[Cubicweb] Multisource in CW

Vincent Michel vincent.michel at logilab.fr
Mon May 21 17:56:52 CEST 2012


Hi list,

I've been playing for a while on a multisource API in CubicWeb that  fulfills 
some needs for one of my project.

In a nutshell, I want to be able to dynamically perform joint queries on 
remote instances, without any local copy of the data, and with as less as 
possible references stored in the local database.

I've tried to make a little use-case in order to illustrate the expected 
behavior of CubicWeb. 

Feedbacks are more than welcome !!!




STORING REFERENCES
------------------

I have in my instance an entity class "Article", with a relation 
"contains_reference" that stores some URIs to external databases (e.g. 
http://dbpedia.org/foobar).

For this, I have modified the entities table by dropping the NOT NULL 
constraints on "source", "asource", and I have added a column "exturi 
VARCHAR(256)". As I don't want to store external references in a specific 
entity type with its own table, I have introduced a new base entity type 
"Thing".

The classical API is still available:

                        relate(1234, "contains_reference", 4567)

but now, one can execute:

                        relate(1234, "contains_reference", 
http://dbpedia.org/foobar)


This will create (if not already existing), an entry in the "entities" table,
with the following data:

EID             8910
type            Thing
source  NULL
asource         NULL
mtime   XX:YY:ZZ
extid   NULL
exturi  http://dbpedia.org/foobar

and it will push the following line in the table 
"contains_reference_relation":

                        1234        8910

This behavior relies on very slight code modifications in the functions 
"related()" and "add_info()". The relations management of CubicWeb stay 
unchanged.

The main idea here is to keep as less as possible information in the database, 
and the reference is an URI, which is universal and does not rely on a 
specific eid in a distant instance.



USING REFERENCES
----------------

Now, some small modifications (repository.py, querier.py) allow to perform the 
following queries (the API may/will be changed...):



rset = rql('Any X, Y WHERE X contains_reference Y')

([871, "http://dbpedia.org/XXX"] (('Article', 'Thing'))
[871, "http://dbpedia.org/AAA"] (('Article', 'Thing'))
[872, "http://dbpedia.org/ZZZ"] (('Article', 'Thing'))
...
[953, "http://dbpedia.org/YYY"] (('Article', 'Thing'))
])




When the external reference is queried, we give the URI rather than the eid as 
it is more informative.

But now, we can joint with distant databases, using the following API:

' |<appid>-<variable use for join> <DISTANT QUERY>'


For example:

rset = rql('Any X, L, D WHERE X contains_reference Y|dbpedia-Y Y label L, Y 
depiction D')

([871, "foo.png"] (('Article', 'Thing'))
[871, "bar.png"] (('Article', 'Thing'))
[872, None] (('Article', 'Thing'))
...
[953, "foobar.png"] (('Article', 'Thing'))
])


The code for the joint is kind of ugly, but is full python (no temporary 
table). Comments on the expected behavior are welcome to help me clarify  this 
part !


Information of Dbpedia, Geonames, etc... can now be mutualized across 
instances, and, even if the internal eids of these databases changed, the 
queries are still valid.



OPEN QUESTIONS
--------------

- What parts of the old multisource/ current multisource may be re-used ?

- The joint is currently done in  full Python: it is maybe interesting to do 
this in a SQL table (this table may be persistant to avoid many create 
table/drop table) ?

- Which API might be used / Which syntax ?

- Do we want the system to transparently perform the joint, i.e. use a 
dictionnary {base url: appid} to automatically determine which instance is 
interesting for which entities ? Or do we let the user control the joint using 
a specified API ?

- For now, the connexion to the distant database uses the appid. Perhaps it 
can use the base-url of the instance instead ?

- If the information is missing on the distant database, should we: remove the 
corresponding line in the rset / return None for this information/ something 
else ?

- How can we construct the description of the rset ?

- How can we plug this rset in standard views ?

- What use-cases may be interesting for YOU ?



Best,


Vincent


More information about the Cubicweb mailing list