[Cubicweb] Multisource in CW

Vincent Michel vincent.michel at logilab.fr
Thu May 24 09:26:54 CEST 2012

Hi all, 

Thnaks for the feedbacks !

On Wednesday 23 May 2012 12:41:22 Nicolas Chauvat wrote:
> On Wed, May 23, 2012 at 10:24:32AM +0200, Sylvain Thénault wrote:
> > > rset = rql('Any X, L, D WHERE X contains_reference Y|dbpedia-Y Y label
> > > L, Y depiction D')
> > 
> > Have you actually implemented this?
> Yes.

Yes, the ("surprising" :) ) 'relate' api has been implemented.
The main idea behind this API is to delegate to CubicWeb the creation of the 
empty entities that are only defined by an URI (and relation based on URI 
rather than EID seems interesting from a semantic point of view)

> > Are 'label' and 'depiction' defined in the schema of 'Thing'?
> No.

No. For now, the local instance does not know the schema of the remote 
instance. This is perhaps a critical point to discuss: do we want the local 
instance to be aware of the remote schema ?
For me, there are two possibilities:

- we do not include at all the schema, and let the user deal with the remote 
schema within the RQL request. I think that this is an interesting option if 
we consider that this multisource is dedicated to quick and on-the-fly joints 
to remote instances (with schemas that may changed...), and that we do not 
want to migrate the local instance.

- we include the schema of the remote instance WITHOUT creating tables. 
Indeed, storing the remote entities in the local database may be interesting 
for few thousands of entities, but with Dbpedia or Geonames, one may pollute 
the local instance with hundreds of thousands of entities. Including the 
schema of the remote instance may be intersting to delegate to Cubicweb the 
interpretation of the RQL (however, if we have two remote instances with 
similar schemas but without the same data, we may want to specify the remote 
instance that we want to use, and thus, the interpretation of the RQL will not 
be usefull anymore).

> > > Information of Dbpedia, Geonames, etc... can now be mutualized across
> > > instances, and, even if the internal eids of these databases changed,
> > > the queries are still valid.
> > 
> > * the source abstraction has been introduced to be able to code
> > application
> > 
> >   independantly from its data sources. And this is imo valuable and kept
> >   in mind, even if we may need specific api/rql syntax to allow
> >   application specific optimization
> Yes.

This is another point that I want to discuss: do we want store the information 
about the remote instance in the local instance, or do we delegate this 
definition within the RQL query ? IMO, the idea of Nicolas is closer to a on-
the-fly behavior:

Any X,L,D WHERE X contains_reference Y WITH Y,L,D
BEING (ANY L,D FROM dbpedia WHERE Y label L, Y depiction D )

and with an URN : appid://dbpedia or http://... or cwsource://dbpedia

> > * I'm not sure we need all that specific stuff and not reusing existing
> > 
> >   abstractions:
> >   
> >   - provided you've a e.g. geoname source which is able to fetch
> >   attributes
> >   
> >     from an url
> In Vincent's demo, it was an instance of cubicweb running a geonames
> cubes and loaded with geonames data from the dump downloaded on their
> website.

I don't have a source Geoname, I have an instance with a Geoname cube and 
Geoname data. Thus, it may be fetched from an URL (but for now, it is only 
in_memory connections). It may be even possible to think a future improvement 
that allows to query SPARQL endpoints or JSONp endpoints, rather that CubicWeb 
In a nutshell, it is perhaps better to keep a reference to a CubicWeb instance 
as weak as possible

> >   - no data stored in entity type tables
> > 
> > ...
> Could it be interesting to allow any entity to be related to a Thing
> (defined by a URL) and have some kind of Datafeed fetch the
> information in the background and make a local copy (reading the
> schema of the remote instance and creating cw_* tables when needed) ?

Making local copy will be painful as soon as we will use huge remote 
instances. Moreover, it will depend on the schema of the distant instance.
E.g. if in the remote Dbpedia instance, the attribute is changed from 
"depiction" to "thumbnail", IMO it is more easy to changed "depiction" to 
"thumbnail" in the few RQL queries that use it, rather than migrate a 
(possibly huge...) SQL table by one renaming attribute.

However, two things may be useful:

- a local cache handling system that avoids to perform multiple similar 
queries (but which is not stored in the database).

- to allow any entity to be related to a Thing (defined by a URL).
Thus, any relation may have Thing as object.

Thanks again for the comments ! I will take a look to the RQL querier to see 
what I can do with it !



More information about the Cubicweb mailing list