Sunday, May 20, 2012

Variables in HyperGraphDB Queries

Introduction
A feature that has been requested on a few occasions is the ability to parametarize HyperGraphDB queries with variables, sort of like the JDBC PreparedStatement allows. That feature has now been implemented with the introduction of named variables. Named variables can be used in place of a regular condition parameters and can contain arbitrary values, whatever is expected by the condition. The benefits are twofold: increase performance by avoiding to recompile queries where only a few condition parameters change and a cleaner code with less query duplication. The performance increase of precompiled queries is quite tangible and shouldn't be neglected. Nearly all predefined query conditions support variables, except for the TypePlusCondition which is expanded into a type disjunction during query compilation and therefore cannot be easily precompiled into a parametarized form. It must also be noted that some optimizations made during the compilation phase, in particular related to index usage, can't be performed when a condition is parametarized. A simple example of this limitation is a parameterized TypeCondition - indices are defined per type, so if the type is not known during query analysis, no index will be used.

The API
The API is designed to fit the existing query condition expressions. A variable is created by calling the hg.var method and the result of that method passed as a condition parameter. For example:

HGQuery<HGHandle> q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest"))));
This will create a query that looks for links of specific type and that point to a given atom, parameterized by the variable "targetOfInterest". Before running the query, one must set a value of the variable:

q.var("targetOfInterest", targetHandle);
q.execute();

It is also possible to specify an initial value of a variable as the second of the hg.var method:


q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest", initialTarget))));
q.execute();
Note the new hg.make static method: it takes the result type and the graph instance and returns a query object that doesn't have a condition attached to it yet. To attach a condition, one must call the HGQuery.compile method immediately after make. This version of make establishes a variable context attached to the newly constructed query. All query conditions constructed after the call to make and before the call to compile will implicitly assume variables belong to that context. That is why to avoid surpises you should call compile right after make.

Once a query is thus created, one can execute it many times by assigning new values to the variables through the hg.var method as shown above.

The Implementation
The requirement of calling HGQuery.compile right of the make method comes from the fact that the existing API constructs conditions in a static environment. The condition is fully constructed before it is passed to the compile method. So to make a variable context available during the condition construction process (the evaluation of the query condition expression), we do something relatively unorthodox here and attach that context (essentially a name/value map) as a thread local variable in the make method. The alternative would have been creating a traversal of the condition expression that collects all variables used within them. Either that traversal process would have to know about all condition types, or the HGQueryCondition would have had to be enhanced with some extra interface for "composite conditions" and "variable used within a condition". I didn't like either possibility. Having a two step process of calling hg.make(...).compile(....) seems clean and concise enough.

Variables are implemented through a couple of new interfaces: org.hypergraphdb.util.Ref, org.hypergraphdb.util.Var. When a condition is created with a constant value (instead of a variable), the org.hypergraphdb.util.Constant implementation of the Ref interface is used. All conditions now contain Ref<T> private member variable where before they contained a member variable of type T.  I won't say more about those reference abstractions, except one should avoid using them, unless one is implementing a custom condition...they are considered internal.

On a side note, this idea of creating a reference abstraction essentially implements the age-old strategy of introducing another level of indirection in order to solve a design problem. I've used this concept in many places now, including for things like caching and dealing with complex scopes in bean containers (Spring), so I spend some time trying to create a more general library around this. Originally I intended to create this library as a separate project and make use of it in HyperGraphDB. But I didn't come up with something satisfactory enough, so instead I created the couple of interfaces mentioned above. However, eventually those interfaces and classes may be replaced by a separate library and ultimately removed from the HyperGraphDB codebase.