Making Boolean Queries Faster

At work today, we were looking to optimize boolean queries of the type:

f1:v1^b1 OR f1:v2^b2 OR f1:v3^b3 ...


f1:v1^b1 AND f1:v2^b2 AND f1:v3^b3 ...

where f1 is a field of some structured data (e.g. Analyzed=No, tf=No, norm=No)

We see that this type of query has these patterns:

  • all values are in the same field
  • all clauses have the same operator joining them

When the number of clause is large, BooleanQuery can be rather slow.

As of Lucene 2.4, Lucene query api has adopted DocIdSet, and DocIdSetIterator abstractions, which opened doors for various query optimizations. (IMHO, this is one of the best improvements in Lucene from an API stand point.)

For our project, we have quite a few OR-clauses, and the search performance was pretty bad.

Instead, when the index loads, we load into memory a data structure very much like the FieldCache.StringIndex. In a 5 million index shard, for a field containing N values, memory footprint is approximately 20MB + N Strings.

Given K clauses in our OR query, we build a bit-set with K bits turned on, each bit corresponding to an index into String array. (max bit is N)

We then built a DocIdSetIterator that iterates the order array and checks to see if the order[i]-bit is turned on in the bit-set.

The resulting performance is quite encouraging: As the number of clauses increases, the iteration time is capped. From our benchmark, when we get to 10+ clauses, the performance gain is approximately 8x!

Here is a detailed write-up:

This is an example of how good API design can open doors to things such as what we were able to do.

Kudos to the Lucene team for the DocIdSet api!