[Cubicweb] CW Multisource

Vincent Michel vincent.michel at logilab.fr
Fri Jun 15 10:02:46 CEST 2012


Hi List,


We have discussed last week with Sylvain and Adrien about the best way to 
implement a multisources approach in CubicWeb that might fulfill some specific
features:

 - join on URI (or cw_uri) rather than eid (as it is currently done in pyro 
source).

 - minimal (or even none) data of the remote instances are stored

   in the local "entities" table.

 - join directly specified in the RQL query.

 
For now, the conclusions are:

 - Storing the URI in database using the ExternalUri entity type. So, to
   store "X is same as 'http://dbpedia.org/XXX'", we create an ExternalUri    
   entity with uri='http://dbpedia.org/XXX' and use this entity for relations.
   This allows to stay more closely to the current behavior of CW. The main
   problem being the possibly huge number of entities to store (and thus the 
   increases in size of the tables entities, is_instance_of_relation,
   is_relation, created_by_relation...). But the issue of the management of
   metadata for huge databases should be solved elsewhere.

- Implementing a "FROM" in RQL (trying to be as close as possible with the
  SPARQL "FROM" specs),  so that someone could execute:
  Any P, F, D WHERE P is Person, P firstname F, F sameas Y WITH D BEING
  (Any D FROM "http://dbpedia.org" WHERE X is People, X birthdate D)

  The local instance executes "Any P, F, D WHERE P is Person,
  P firstname F, F sameas Y", and retrieves all the URI of the Y entities.
  It will use these URIs in the join with the distant instance, based on the
  cwuri of the distant entities, for now using something like
  "...., X cwuri IN (...)"

  The local instance is not aware of the schema of the distant instances.


The goal of the "FROM" clause is to:

 - explicitly define what is the remote source of the data.
 
 - clearly define in the RQL syntax tree which parts should be executed 
locally and which parts should be execute remotely. This is different of what 
is done today in CW, where the sub-requests (remote) are performed before the 
main request (local).
 
 - identify the remote parts of the query, so that the RQL parser does not try 
type analysis on these sub-requests. This would lead to something like 
"Unknown" in the corresponding rset's description cell. 


I will try to implement this and test it. For now, there will be some 
limitations (e.g. we can only remotely fetch an entity with its attributes), 
but further improvements may include:

 - defining the expected behavior in some particular case
   (e.g. what is the Rset if Y does not exist in "http://dbpedia.org"
   or if D is None ?)

 - some cache management (in memory or in database), that should not be too
   persistent (i.e. that should be cleared after each "stop" for example).

- aggregation management


I hope that it is clear enough, do not hesitate if you have any 
comment/question/issue.
Does the syntax seem clear to you ? Does it seem interesting for some of your 
applications ?


Best,

Vincent
  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cubicweb.org/pipermail/cubicweb/attachments/20120615/74c8ba9d/attachment-0164.html>


More information about the Cubicweb mailing list