[Cubicweb] On advanced full text search in CubicWeb

Florent Cayré florent at secondweb.fr
Fri Jun 4 10:09:47 CEST 2010

Hi there,

full text search in CubicWeb is great (notably fulltext_container
functionality), but lacks some advanced features that would make it a
very powerful tool, among which (in decreasing order of interest as
far as I am concerned) :

* results pertinence ordering
* advanced content parsing
* advanced requests

Pertinence ordering

What should be done here is :

* allow a rank function to be used as an ORDERBY clause ; we should
maybe provide a single RQL function that uses the "standard" function
of the SGBD if it exists, or no specific order otherwise (think of
sqlite for example)
* allow pertinence computation advanced usage when the used SGBD
supports it ; PostgreSQL has advanced features in this area which
really make a difference and we should be able to use it : we could
then use weights to increase the pertinence value of an attribute or
relation of an entity among the other attributes/ relations (typical
usage is to increase a blog article title pertinence compared to its
content) and to increase the pertinence value of an entity type among
the other entity types (eg. : estimating users are more likely to
search for an event in a city than people living in a city, we could
put a higher weight to event entity type than to people entity type,
assuming fulltext_container for city relations are set so that event
and people are the containers) ; PostgreSQL supports this and I think
we could add RQL support for this by adding optional weight
properties/ methods in the business entities (defaulting to uniform
"0.5" valued weights) ::

 class Event(AnyEntity):
     __regid__ = 'Event'

     fulltext_weight = 1.

     def fulltext_rel_weight(self, rtype, role='subject'):
         return str('rtype') == 'title' and 1. or super(Event,
self).fulltext_rel_weight(self, rtype, role)

 class Person(AnyEntity):
     __regid__ = 'Event'

     fulltext_weight = 0.8

     def fulltext_rel_weight(self, rtype, role='subject'):
         return str('rtype') == 'name' and 1. or super(Person,
self).fulltext_rel_weight(self, rtype, role)

Note we can weight attributes and relations the same way.

See http://www.postgresql.org/docs/8.3/static/textsearch-controls.html
for more information on how to implement this using PostgreSQL,
notably the setweight function.

Advanced content parsing / advanced requests (using boolean operators, etc.)

I haven't spent a lot of time looking into this so a lot of job is
still to be done, I just give my first feelings here.

In text parsing area also, some SGBDs do a great job and PostgreSQL is
among them. Making these advanced functionalities (see chapters of
http://www.postgresql.org/docs/8.3/static/textsearch.html : parsers,
dictionaries) accessible is a first achievable goal, before eventually
thinking of reimplementing them to ensure SGBD portability : we could
then experiment a bit with this tools and see if they are useful
enough to re implement them in logilab.database code base. Today, a
basic parser is supplied by CubicWeb which role is also to ensure
basic systems like sqlite are still usable, and as such, this software
layer must be preserved. It would be great however to be able to use a
custom parsing system. With PostgreSQL again, this could be fairly
simple because we can pass an optional configuration name to fulltext
search functions that can be a very powerful tool. The only problem is
that we cannot use this powerful functionality now because the CW
default parsing mechanism cannot be bypassed.

Regarding advanced requests, I am really unsure if more than a basic
all AND or all OR feature would really be necessary : do users really
use complex requests? I doubt it, and so does
which favours the usage of facets (already implemented in CW).

I really need help to make these thoughts on full text search a
reality in CW, but I am available to help specifying or coding some
parts with some directions. Any help from Logilab and others would be
greatly appreciated.

Florent, SecondWeb.

More information about the Cubicweb mailing list