Wednesday, July 8, 2009

Clucene-Redland: semantic queries with Clucene

I've been hanging out in #strigi a bit lately, and one thing I picked up from Jos van den Oever is that Clucene could do with RDF search capabilities. Clucene is fairly well suited to handling tuple searches: an RDF resource can be represented in a document. Each document can contain as many fields as required, and more importantly they don't need to be predefined like in a database table. Values can only be stored as strings, but there are lots of tools in clucene to treat the values as integers or dates.

As a proof of concept, I have written a storage back-end for the redland librdf. This was very easy to get running, but at this stage is not complete. Currently the storage is read-only and uses the index created by Strigi (which has RDF like structures for indexed documents).

At this stage, the code I've written can, for example, run this query:

PREFIX www:
PREFIX fdo:
PREFIX fd:
PREFIX strigi:
SELECT ?url, ?parent_type WHERE {
?x fd:fileExtension "pdf" .
?x fd:url ?url .
?x fdo:isPartOf ?parent .
?y fd:url ?parent .
?y www:type ?parent_type

}


Implementation

To support redlands rdf queries,there are 7 possible combinations of a subject, predicate, object search that need to be supported. Here is how I would implement them (I have only implemented a few at this point, just enough to do a query like the previously mentioned SPARQL query).

  1. S P O

    • This is a test that a certain document contains a certain key/value. Implementation just loads the document and gets the field and compares the value against expected and returns the query if it exists, and nothing otherwise

  1. S P ?

    • This is a request for the value of a certain document's field. Implementation just loads the document and returns a list made from the field->getValues() array.

  1. S ? ?

    • Similar to S P ? but go through each field in the document->getFields() vector (or using the DocumentFieldEnumeration in Clucene <= 0.9.21).

  1. S ? O

    • This is a request for a list of fields containing a certain value. Implementation just loads the document and goes through each field searching for the requested value.

  1. ? P O

    • This is a request for documents containing a certain key/value. This is a typical Clucene query, but in order to respect case sensitivity, the implementation currently is to use a TermDocs enumerator, which gives a list of documents containing a given term (a Clucene field/value pair).

  1. ? ? O

    • Same as ? P O, but repeat the query for every available field in Clucene (the list of fields can be retrieved using the IndexReader->getFieldNames function).

  1. ? P ?

    • This is a request for all values in a field and the documents which contain them. Implementation would be to use a Clucene TermEnum. The TermEnum is an iterator of all the terms in the index (ordered by key then value). So to list all the terms with the given predicate we would first skip to the first term containing our predicate (by passing the term to the skip function), then iterate through TermEnum until the term's field changes. The next step is to take each of these terms, and do the same as was done in ? P O.

Problems to be solved:

Contexts:

I haven't implemented this yet, but the implementation could just use a Clucene filter (basically just a document bitset containing matching documents). The filter can potentially be cached for performance reasons. The search implementations of the S/P/O's would have to be modified to check first in the bitset for the returned document.

Data types

I haven't given much thought to how to handle the different data types in RDF. Clucene can only store string terms, but there are a number of tools in Clucene for storing integers and dates so that search comparisons can be made (this is done by padding the values).

Case sensitivity

I haven't given much thought to case sensitivity either.

A solution for the data types and case sensitivity of the above problems may be to 'encode' the case sensitivity and data type in the field (for example, something like http://url#^^datatype@case). If this was done, a customised Clucene MultiFieldQueryParser would probably be necessary to intelligently handle field types when a query is made.

The power of search

Once there is a Clucene backend capable of doing SPARQL queries, the obvious next step would be to extend search capabilities to the queries, much like what is done by the Sesame2 lucene sail. This would allow for something like:

PREFIX search:
SELECT ?x ?score ?snippet WHERE {?x search:matches ?match.
?match search:query "ben";
search:score ?score;
search:snippet ?snippet. }

Getting CLucene-redland

Currently the code resides on the Clucene git repository in the 'clucene-redland' branch. Git details can be found at https://sourceforge.net/scm/?type=git&group_id=80013. The actual project code is in the /src/contribs/clucene-redland directory.

NOTE: I haven't had a chance to figure out why (I have a baby coming in the coming few weeks!!!), but the code doesn't currently compile with the default Ubuntu redland-dev, raptor-dev and rasqal-dev packages. I was working with the redlands source code directly so I could trace into the code, and after trying with the default ubuntu packages, it was segfaulting.

NOTE: Currently the clucene-redlands code doesn't build as a real redland module, so code must call the initialise function before loading the clucene storage object. Loading the Clucene storage is done something like this:

librdf_storage_clucene_initialise(world);
storage = librdf_new_storage(world, "clucene", "test", “dir='strigi/index/dir'”);

Changing it to a real redlands module would be trivial.

Future

I don't have much time to develop this further (again, having a baby soon!!). This was done as a proof of concept and with the hope that someone else will pick it up (I of course would like to see more interest in CLucene in general). I suggest anyone interested come onto the CLucene mailing list (https://lists.sourceforge.net/lists/listinfo/clucene-developers) to discuss and/or post a comment here.


Ben van Klinken

Original Author of Clucene, ustramooner@users.sourceforge.net