[Cubicweb] Multisource in CW

Vincent Michel vincent.michel at logilab.fr
Fri May 25 09:14:08 CEST 2012


On Thursday 24 May 2012 18:19:19 Sylvain Thénault wrote:
> On 24 mai 12:09, Nicolas Chauvat wrote:
> > Hi All,
> > 
> > On Thu, May 24, 2012 at 09:26:54AM +0200, Vincent Michel wrote:
> > > - we do not include at all the schema, and let the user deal with the
> > > remote schema within the RQL request. I think that this is an
> > > interesting option if we consider that this multisource is dedicated
> > > to quick and on-the-fly joints to remote instances (with schemas that
> > > may changed...), and that we do not want to migrate the local
> > > instance.
> > 
> > The key point is that in order to use CW's "base ui framework" that
> > generates a large part of the UI, we need a datamodel/schema. Quick
> > joints and on-the-fly queries without previous knowledge of the schema
> > means that you have to code a lot by hand in the view that fetches
> > this external data.
> 
> That's a basement question : should we consider that having to define
> a schema for the external source is a pb or not? CW without schema
> information doesn't sounds like CW anymore, so I hope the answer is no :)

I don't no... :) All the instances have their own schema (this is still CW !)
but I'm not sure if we want the local instance to be aware of the schemas of 
the remote instances.
The problem that I can see here is that the schema in the local database may 
be huge (4 or 5 schemas grouped), and that modifying the schema of a remote 
instance will be painful to be duplicated in the locale instance.

But I agree that knowing the schema may be really helpful.
I don't know if it is possible to dynamically reload the schema on demand into 
the local instance:

 1 - the local instance should execute query on a remote instance.

 2 - the local instance retrieves the schema of the remote instance (pickle ?)
     and store/update it in a dynamic schema.
 
 3 - the query is performed with knowledge about all the schemas.

Nothing is stored in the local database. Don't know if such a behavior may be 
interesting.

> 
> > > I don't have a source Geoname, I have an instance with a Geoname cube
> > > and Geoname data. Thus, it may be fetched from an URL (but for now, it
> > > is only in_memory connections). It may be even possible to think a
> > > future improvement that allows to query SPARQL endpoints or JSONp
> > > endpoints, rather that CubicWeb instances..
> 
> So I guess you're doing things for experimental purpose, since once you've
> a geoname CW instance, you can use regular pyrorql source right ?
> 

As explained by Nicolas, I don't have a reference to a specific eid in a 
specific instance, but rather an URI that may not correspond to 
http://baseurl:port/eid

The second point is that the remote instances may not be accessible by Pyro. 
For now I use in_memory connections, but as mentioned earlier, it is perhaps 
interesting to rather use web requests, allowing to have some CW reference 
databases (dbpedia, geonames, ...) stored somewhere and connecting to them 
using htpp://...?rql=...


> > Allowing the query planner to mix and match CubicWeb, SPARQL and
> > JSONp... sounds interesting, but difficult.
> 
> Writing SPARQL/JSONp source shouldn't be that hard.

Good news :p

> 
> I still suspect Vincent's problem lies in the way the multi-sources query
> planner is implemented currently, eg :
> 
>  "Any X,XA WHERE Y linked_to X, X attribute A"
> 
> where X could come from an external (but not Y) is currently executed with
> the following steps:
> 
> 1. fetch "Any X,XA WHERE X attribute A" from the external source and store
>    results in a temporary table, along with records for this query from the
>    system source
> 
> 2. execute "Any X,XA WHERE Y linked_to X, X attribute A" on the system
> source using the temporary table for X/XA
> 
> while we could want:
> 
> 1. execute "Any X WHERE Y linked_to X" on the system source
> 
> 2. retrieve XA from external sources for each X returned by the previous
> step
> 
> 3. build rset for X,XA
> 
> which would be practical on situation where the source is e.g. geonames or
> dbpedia while the former isn't.


Yes, this is the idea. The second idea is that the query planner should 
understand that in "Any X WHERE Y linked_to X" it should use the URI to join 
between the two instances.
Is it doable to implement such a behavior (or is the query planner not 
flexible enough) ?



More information about the Cubicweb mailing list