Useless Inc. - Scripting Lucene with Python

Scripting Lucene with Python

Among a lot of other tools and technologies we use Lucene here at Delver. Lucene is a very mature, high-performance and full-featured information retrieval library, which in simple terms means: it allows you to add search functionality to your applications with ease. I could spend an entire day talking about Lucene, but instead I'd like to show you how to write scripts for Lucene with Python on Windows (and thus differs from Sujit Pal's excellent instructions on the same subject matter).

First, make sure you have your Python environment all set up -- at the time of writing this I use the Python 2.5.1 Windows installer. Next you'll need Python bindings for the library, so go and grab a copy of PyLucene; to make a long story short I suggest you use the JCC version for Windows which is downloadable here. Grab it and install it, no fussing necessary. Finally you'll need the Java VM DLL in your path, and this depends on which type of JRE you're using:

If you're using the Java JDK, add the mental equivalent of C:\Program Files\Java\jdk1.6.0_03\jre\bin\client to your path;
If you're using a JRE, add C:\Program Files\Java\jre1.6.0_03\bin\client to your path.

(As an aside, you should probably be using %JAVA_HOME% and %JRE_HOME% and indirecting through those.)

Now you can quickly whip down scripts in Python, like this one which took about two minutes to write:

#!/usr/bin/python
#
# extract.py -- Extracts term from an index
# Tomer Gabel, Delver, 2008
#
# Usage: extract.py <field_name> <index_url>
#

import sys
import string
import lucene
from lucene import IndexReader, StandardAnalyzer, FSDirectory

def usage():
	print "Usage:\n"
	print sys.argv[ 0 ] + " <field_name> <index_url>"
	sys.exit( -1 )

def main():
	if ( len( sys.argv ) < 3 ):
		usage()

	lucene.initVM( lucene.CLASSPATH )

	term = sys.argv[ 1 ]
	index_location = sys.argv[ 2 ]
	
	reader = IndexReader.open( FSDirectory.getDirectory( index_location, False ) )
	try:
		for i in range( reader.maxDoc() ):
			if ( not reader.isDeleted( i ) ):
				doc = reader.document( i )
				print doc.get( term )
	finally:
		reader.close()

main()

Note the VM initialization line at the beginning -- for the JCC version of PyLucene this is a must, but even like so using the library is extremely painless. Kudos to the developers responsible!

Sunday, 24 February 2008 12:26:41 (Jerusalem Standard Time, UTC+02:00)

-
Development