Storing UID with Column Stride Fields

In an earlier post, I discussed a technique to store an application provided UID for a document using Payloads that allows for fast loading and lookup via a Lucene internal docid. I reran the test program on my new laptop, and for 2M docs with randomly generated UIDs of type long, loading into a lookup array took 90ms.

With the beta release of Lucene 4.0.0, a new feature called Column Stride Fields was introduced. Column Stride Fields provide a forward lookup from a Lucene internal docid to a typed value, which makes storing an UID of type long a perfect use-case for it.

I modified my test program to store the UID instead of Payloads, into a Column Stride Field:

// indexing
for (long id : uniqIds){
   Document doc = new Document();
   Field fld = new LongDocValuesField("_ID", id);
   doc.add(fld);
   writer.addDocument(doc);
}

// loading
AtomicReader reader = ...

DocValues docVals = reader.docValues("_ID");
long[] uidArray = (long[]) docVals.getSource().getArray();

As you can see, the code is much simpler: No need to construct a fake token stream and no need to encode into and decode from a byte array from a long etc. Intuition says this would be faster too since we know the type ahead of the time and conversion between long and byte[] can be avoided. So I re-timed the program, and loading the array took 43ms. That is more than a 200% speed-up!

Details of Column Stride Fields can be found here. Kudos to Simon Willnauer for implementing this!