[Cubicweb] CW Multisource
vincent.michel at logilab.fr
Fri Jun 15 10:02:46 CEST 2012
We have discussed last week with Sylvain and Adrien about the best way to
implement a multisources approach in CubicWeb that might fulfill some specific
- join on URI (or cw_uri) rather than eid (as it is currently done in pyro
- minimal (or even none) data of the remote instances are stored
in the local "entities" table.
- join directly specified in the RQL query.
For now, the conclusions are:
- Storing the URI in database using the ExternalUri entity type. So, to
store "X is same as 'http://dbpedia.org/XXX'", we create an ExternalUri
entity with uri='http://dbpedia.org/XXX' and use this entity for relations.
This allows to stay more closely to the current behavior of CW. The main
problem being the possibly huge number of entities to store (and thus the
increases in size of the tables entities, is_instance_of_relation,
is_relation, created_by_relation...). But the issue of the management of
metadata for huge databases should be solved elsewhere.
- Implementing a "FROM" in RQL (trying to be as close as possible with the
SPARQL "FROM" specs), so that someone could execute:
Any P, F, D WHERE P is Person, P firstname F, F sameas Y WITH D BEING
(Any D FROM "http://dbpedia.org" WHERE X is People, X birthdate D)
The local instance executes "Any P, F, D WHERE P is Person,
P firstname F, F sameas Y", and retrieves all the URI of the Y entities.
It will use these URIs in the join with the distant instance, based on the
cwuri of the distant entities, for now using something like
"...., X cwuri IN (...)"
The local instance is not aware of the schema of the distant instances.
The goal of the "FROM" clause is to:
- explicitly define what is the remote source of the data.
- clearly define in the RQL syntax tree which parts should be executed
locally and which parts should be execute remotely. This is different of what
is done today in CW, where the sub-requests (remote) are performed before the
main request (local).
- identify the remote parts of the query, so that the RQL parser does not try
type analysis on these sub-requests. This would lead to something like
"Unknown" in the corresponding rset's description cell.
I will try to implement this and test it. For now, there will be some
limitations (e.g. we can only remotely fetch an entity with its attributes),
but further improvements may include:
- defining the expected behavior in some particular case
(e.g. what is the Rset if Y does not exist in "http://dbpedia.org"
or if D is None ?)
- some cache management (in memory or in database), that should not be too
persistent (i.e. that should be cleared after each "stop" for example).
- aggregation management
I hope that it is clear enough, do not hesitate if you have any
Does the syntax seem clear to you ? Does it seem interesting for some of your
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Cubicweb