[Cubicweb] Multisource in CW

Sylvain Thénault sylvain.thenault at logilab.fr
Fri May 25 09:53:19 CEST 2012


On 25 mai 09:14, Vincent Michel wrote:
> On Thursday 24 May 2012 18:19:19 Sylvain Thénault wrote:
> > That's a basement question : should we consider that having to define
> > a schema for the external source is a pb or not? CW without schema
> > information doesn't sounds like CW anymore, so I hope the answer is no :)
> 
> I don't no... :) All the instances have their own schema (this is still CW !)
> but I'm not sure if we want the local instance to be aware of the schemas of 
> the remote instances.
> The problem that I can see here is that the schema in the local database may 
> be huge (4 or 5 schemas grouped), and that modifying the schema of a remote 
> instance will be painful to be duplicated in the locale instance.
> 
> But I agree that knowing the schema may be really helpful.
> I don't know if it is possible to dynamically reload the schema on demand into 
> the local instance:
> 
>  1 - the local instance should execute query on a remote instance.
> 
>  2 - the local instance retrieves the schema of the remote instance (pickle ?)
>      and store/update it in a dynamic schema.
>  
>  3 - the query is performed with knowledge about all the schemas.
> 
> Nothing is stored in the local database. Don't know if such a behavior may be 
> interesting.
 
IMO this is interesting but should be left for later enhancements. This would
be interesting for automatic ui generation (and would require the external
source to be schema aware anyway), but in real-application cases, we know that
automatic ui generation is usually not enough to build a nice interface.

> > > > I don't have a source Geoname, I have an instance with a Geoname cube
> > > > and Geoname data. Thus, it may be fetched from an URL (but for now, it
> > > > is only in_memory connections). It may be even possible to think a
> > > > future improvement that allows to query SPARQL endpoints or JSONp
> > > > endpoints, rather that CubicWeb instances..
> > 
> > So I guess you're doing things for experimental purpose, since once you've
> > a geoname CW instance, you can use regular pyrorql source right ?
> 
> As explained by Nicolas, I don't have a reference to a specific eid in a 
> specific instance, but rather an URI that may not correspond to 
> http://baseurl:port/eid

I understand that, I just tell my understanding is that this is a POC
to imagine the external souce *isn't* a CW instance, and just ask for
confirmation.

> The second point is that the remote instances may not be accessible by Pyro. 
> For now I use in_memory connections, but as mentioned earlier, it is perhaps 
> interesting to rather use web requests, allowing to have some CW reference 
> databases (dbpedia, geonames, ...) stored somewhere and connecting to them 
> using htpp://...?rql=...

I though the goal was to directly reach dbpedia / geoname, not a CW instance
holding their data (in such case I still fail to see the pb).

> > I still suspect Vincent's problem lies in the way the multi-sources query
> > planner is implemented currently, eg :
> > 
> >  "Any X,XA WHERE Y linked_to X, X attribute A"
> > 
> > where X could come from an external (but not Y) is currently executed with
> > the following steps:
> > 
> > 1. fetch "Any X,XA WHERE X attribute A" from the external source and store
> >    results in a temporary table, along with records for this query from the
> >    system source
> > 
> > 2. execute "Any X,XA WHERE Y linked_to X, X attribute A" on the system
> > source using the temporary table for X/XA
> > 
> > while we could want:
> > 
> > 1. execute "Any X WHERE Y linked_to X" on the system source
> > 
> > 2. retrieve XA from external sources for each X returned by the previous
> > step
> > 
> > 3. build rset for X,XA
> > 
> > which would be practical on situation where the source is e.g. geonames or
> > dbpedia while the former isn't.
> 
> 
> Yes, this is the idea. The second idea is that the query planner should 
> understand that in "Any X WHERE Y linked_to X" it should use the URI to join 
> between the two instances.
> Is it doable to implement such a behavior (or is the query planner not 
> flexible enough) ? 

I don't get your point here. Think to the local eid as an implementation
detail for performance reason (we have tried  to have string as keys a while
ago, moving to int had a huge performance impact). That's the source's job
to translate the eid into the URI than to query whatever source to do the join.
 
-- 
Sylvain Thénault, LOGILAB, Paris (01.45.32.03.12) - Toulouse (09.54.03.55.76)
Formations Python, Debian, Méth. Agiles: http://www.logilab.fr/formations
Développement logiciel sur mesure:       http://www.logilab.fr/services
CubicWeb, the semantic web framework:    http://www.cubicweb.org


More information about the Cubicweb mailing list