5 min read

go-useragent

Why?

User-Agent parsing is much more primitive than you think. If you look under the hood of most common implementations, especially the existing solutions in the Go ecosystem, there is a lot of brute forcing involved.

More specifically, it’s mostly regular expressions. Hundreds and hundreds of them evaluated against every incoming string until a match is discovered. It feels… wasteful. Surely there is a better way?

I decided to build go-useragent to provide a high-performance alternative to existing libraries that were slower or more memory intensive. Personally, this was a perfect opportunity to learn about the Go runtime at a much lower level and see how far I could push things in terms of performance.

The Idea

Rather than evaluating the same complex regex logic over and over again, I wondered if we could move all that heavy lifting into a one-time startup cost.

Researching more on this topic, I stumbled on an article by Raygun. They had sped up their ingestion pipeline 45x using a trie (or prefix tree). By pre-populating a tree with known browsers during initialization, they could parse a string by reading it character by character in a single pass.

Uncompressed Trie

moz...ac

Compressing Tries

Raygun further optimized their approach by compressing their tree into a Radix Trie, where they squash non-branching paths (like m-o-z-i-l-l-a) into single nodes to save on dictionary lookup costs.

Radix Trie

mozillaac Substring Substring

However, I opted against that. While compression saves memory and improved performance for them, it introduces heavy branching logic to check substrings. In my profiling experiments, managing this complexity yielded slower implementations from the overhead. In contrast, keeping an uncompressed allowed for a tighter, more predictable parser loop.

Maps in Go

A naive implementation of a trie Go may rely on maps. However, Go’s built-in maps are heavy. Initializing a single map adds an overhead of ~160 bytes to hold the hashes, keys, and values. This overhead scales poorly when allocating millions of tiny, sparsely populated data nodes.

Map-Based Node

struct node ('M')+charrune+childrenmap[rune]*node Hash Bucket (~160B)+keys[8]rune('o', 'a')+values[8]*node hash(rune)

Since User-Agent data naturally doesn’t branch as much, I opted to implement a slice backed variant of a trie.

Instead of a heavy map, each node contains a 4-byte rune and a 24-byte slice header (pointer, length, capacity). With memory padding, the entire node using packs into 32 bytes.

Furthermore, storing the children of the slice of values ([]node) means the child nodes are allocated as a single, contiguous block of memory. Looping over a small, contiguous slice exploits CPU cache locality perfectly, which leads to significantly faster parsing over computing a hash and bouncing around random access heap memory to resolve a map bucket.

Slice-Based Node

struct node ('M')+charrune+children[]node Contiguous Memory (Children)+node 'o' (32B) | node 'a' (32B) | ...void ptr + offset

This single switch cut the parser’s base memory footprint from ~19MB down to ~7MB and pushed performance even further despite requiring more arithmetic comparisons per node.

Other Notable Optimizations

There were a slew of other minor optimizations I made, such as:

  • State Machines: Rewriting the core loop as a strict finite state machine which allowed the compiler to generate optimal jump tables, reducing branch prediction misses.
  • Forking Standard Library: Functions like unicode.IsLetter decode strings into rune types for full UTF-8 support. However, as we only care about ASCII, I wrote leaner variants of the standard library functions to reduce branching even further.
  • Struct Alignment: Go pads struct fields similar to C. Optimally sorting the fields allows us to pack the tree’s memory footprint as tightly as possible.

Status

go-useragent is in a stable, “finished” state. It is actively used in production traffic for my analytics project, Medama Analytics.

As for the results, the benchmarks demonstrate that we are 3-4x the performance of other solutions while generating zero memory allocations that pressure the GC.