Mental Detritus: Grokking Solr Trie Fields

I've been trying to wrap my head around Solr's trie field types for the past week, and finally made a break through.

Trie

The first thing you need to understand is the idea of a trie data type. Here's a basic outline of the data structure. Let's say you want to index the following list of words.

bad

bag

bar

bin

bit

We can arrange these words in a tree structure as follows.

b--a--d

| |

| |--g

| |

| \--r

\--i--n

\--t

To reconstruct one of the words you start at the root node of the tree, and work your way towards a leaf node, keeping track of the letters you encounter along the way. Depending on what you're using the trie data structure for, you may store some piece of data in one of the leaf nodes. If you want more information about the trie data structure, I will refer you to the Wikipedia page, which is fairly complete and easy to understand.

Solr's version of the trie

Another term for a trie is a "prefix tree". Solr uses this idea of prefixes to index numbers so that it can perform range queries efficiently. Just like we can organize words into tries, we can also organize numbers into tries.

Let's say I want to index the integer 3735928559. For clarity, let's rewrite that in hexadecimal, 0xDEADBEEF. When we index this using a TrieIntField, Solr stores the integer four times at different levels of precision.

0xDE000000

0xDEAD0000

0xDEADBE00

0xDEADBEEF

What Solr is doing here is constructing numbers with different length prefixes. This would be equivalent to a trie with this structure.

DE--AD--BE--EF

The reason that this allows for fast range queries is because of what the prefixes represent. The prefixes represent a range of values. It might be better to think of them indexed like this, instead.

0xDExxxxxx

0xDEADxxxx

0xDEADBExx

0xDEADBEEF

Each "x" represents an unset digit. That means the entry 0xDEADxxxx represents every number from 0xDEAD0000 to 0xDEADFFFF. You can get a better feel for this if you play around in the analysis section of the Solr admin console.

Precision Step

The option to set the precision step was the part that I understood the least. The available documentation is rather dense and unhelpful. The precision step lets you tune your index, trading range query speed for index size. A smaller precision step will result in a larger index and faster range queries. A larger precision step will result in a smaller index and slower range queries.

In the example above, I was using a precision step of 8, the default. What the precision step means is how many bits get pruned off the end of the number. Let's see what would happen if we indexed 0xDEADBEEF with a precision step of 12.

0xDExxxxxx

0xDEADBxxx

0xDEADBEEF

And here with a precision step of 4.

0xDxxxxxxx

0xDExxxxxx

0xDEAxxxxx

0xDEADxxxx

0xDEADBxxx

0xDEADBExx

0xDEADBEEx

0xDEADBEEF

As you can see, compared to the default precision step of 8, a precision step of 4 doubled the number of entries in the index. The way it speeds up range searches is by allowing better granularity. If I wanted to search for the documents matching the range 0xDEADBEE0 to 0xDEADBEEF with the default precision step, I would have to check all 16 records in the index and merge the results. With the precision step of 4, I can check the one record for 0xDEADBEEx and get the results I want.

That's a bit of a cherry picked example, but arbitrary range queries will be faster. How that works is left as an exercise for the reader.

May 5, 2015 - I originally thought that the trie data type was not supported directly by Lucene, but this is incorrect. Tokenization of numeric types into the various prefixes is the default behavior of Lucene when indexing numbers.

Mental Detritus

Thursday, January 3, 2013

Grokking Solr Trie Fields

Trie

Solr's version of the trie

Precision Step

4 comments: