When we started using DuckDB more heavily at Altertable, we quickly discovered the NetQuack (community-maintained) extension for URL parsing. While functional, its performance was... disappointing. Extracting domains from URLs was taking 37 seconds for 30,000 URLs. For a database that prides itself on analytical performance, this felt wrong.
After digging into the code, the culprits were obvious: heavy reliance on
What Was Wrong
The original NetQuack implementation had two major performance killers. First, every URL component extraction used
Loading code...
Second, TLD extraction required querying a database table populated from the Mozilla Public Suffix List:
Loading code...
This meant every domain extraction triggered multiple database queries. For analytical workloads processing thousands of URLs, this was a disaster.
The ClickHouse-Inspired Solution
We completely rewrote the extension using character-by-character parsing instead of regex:
Loading code...
This approach uses
Then, the bigger win was eliminating database dependencies. We use
Loading code...
This creates a collision-free hash function compiled directly into the extension binary. Now TLD lookups are O(1) memory operations instead of database queries:
The perfect hash table adds only ~500KB to the binary size but eliminates runtime PSL downloads, database table storage, query execution overhead, and memory allocations for temporary strings.
Loading code...
The Results
We benchmarked both versions against 30,000 URLs. The improvements were staggering:
Function | Before | After | Improvement |
---|---|---|---|
Loading code... | 0.069s | 0.010s | 6.9x faster |
Loading code... | 0.086s | 0.008s | 10.7x faster |
Loading code... | 0.533s | 0.009s | 59.2x faster |
Loading code... | 0.127s | 0.008s | 15.8x faster |
Loading code... | 0.202s | 0.004s | 50.5x faster |
Loading code... | 37.503s | 0.012s | 3125.2x faster |
Loading code... | 35.831s | 0.010s | 3583.1x faster |
Loading code... | 36.003s | 0.009s | 4000.3x faster |
Loading code... | 0.329s | 0.007s | 47.0x faster |
The TLD-dependent functions saw the most dramatic improvements because they eliminated both regex parsing AND database queries. The architecture change was fundamental:
- Before: Loading code...
- After: Loading code...
The old implementation had O(n) regex compilation per URL plus O(log n) database queries with disk I/O. The new implementation has O(n) single-pass character scanning with O(1) perfect hash lookups in memory.
And despite the complete rewrite, all existing tests pass and edge cases are handled correctly.
Key Takeaways
Profile before optimizing. The 37-second domain extraction time immediately pointed to the real bottlenecks. Always measure first.
Eliminate I/O in hot paths. Database queries in URL parsing functions were the primary performance killer. Moving to compile-time data structures eliminated this entirely.
Character parsing beats regex for simple patterns. For structured data like URLs, character-by-character parsing is often faster and more predictable than regex.
Perfect hashing works great for static data. When you have a known, static dataset (like TLDs), perfect hash functions provide optimal lookup performance.
Zero-copy string operations matter. Using
Try It Yourself
The optimized NetQuack extension is currently under review in Pull Request #13. While we wait for the maintainer to review and hopefully merge these changes, you can:
-
Test the optimized version by building from our branch:
Loading code... -
Run the benchmarks yourself to see the performance improvements:
Loading code... -
Check out the source code on GitHub to see the detailed implementation.
This optimization journey reminded us why performance matters in analytical databases. When you're processing millions of URLs, the difference between 37 seconds and 0.012 seconds isn't just a nice-to-have—it's the difference between interactive analysis and waiting for coffee to brew.
Sometimes the best optimization is throwing out the old approach entirely and starting fresh with the right architectural principles. In this case, ClickHouse showed us the way.