[Cubicweb] Multisource in CW
vincent.michel at logilab.fr
Thu May 24 09:26:54 CEST 2012
Thnaks for the feedbacks !
On Wednesday 23 May 2012 12:41:22 Nicolas Chauvat wrote:
> On Wed, May 23, 2012 at 10:24:32AM +0200, Sylvain Thénault wrote:
> > > rset = rql('Any X, L, D WHERE X contains_reference Y|dbpedia-Y Y label
> > > L, Y depiction D')
> > Have you actually implemented this?
Yes, the ("surprising" :) ) 'relate' api has been implemented.
The main idea behind this API is to delegate to CubicWeb the creation of the
empty entities that are only defined by an URI (and relation based on URI
rather than EID seems interesting from a semantic point of view)
> > Are 'label' and 'depiction' defined in the schema of 'Thing'?
No. For now, the local instance does not know the schema of the remote
instance. This is perhaps a critical point to discuss: do we want the local
instance to be aware of the remote schema ?
For me, there are two possibilities:
- we do not include at all the schema, and let the user deal with the remote
schema within the RQL request. I think that this is an interesting option if
we consider that this multisource is dedicated to quick and on-the-fly joints
to remote instances (with schemas that may changed...), and that we do not
want to migrate the local instance.
- we include the schema of the remote instance WITHOUT creating tables.
Indeed, storing the remote entities in the local database may be interesting
for few thousands of entities, but with Dbpedia or Geonames, one may pollute
the local instance with hundreds of thousands of entities. Including the
schema of the remote instance may be intersting to delegate to Cubicweb the
interpretation of the RQL (however, if we have two remote instances with
similar schemas but without the same data, we may want to specify the remote
instance that we want to use, and thus, the interpretation of the RQL will not
be usefull anymore).
> > > Information of Dbpedia, Geonames, etc... can now be mutualized across
> > > instances, and, even if the internal eids of these databases changed,
> > > the queries are still valid.
> > * the source abstraction has been introduced to be able to code
> > application
> > independantly from its data sources. And this is imo valuable and kept
> > in mind, even if we may need specific api/rql syntax to allow
> > application specific optimization
This is another point that I want to discuss: do we want store the information
about the remote instance in the local instance, or do we delegate this
definition within the RQL query ? IMO, the idea of Nicolas is closer to a on-
Any X,L,D WHERE X contains_reference Y WITH Y,L,D
BEING (ANY L,D FROM dbpedia WHERE Y label L, Y depiction D )
and with an URN : appid://dbpedia or http://... or cwsource://dbpedia
> > * I'm not sure we need all that specific stuff and not reusing existing
> > abstractions:
> > - provided you've a e.g. geoname source which is able to fetch
> > attributes
> > from an url
> In Vincent's demo, it was an instance of cubicweb running a geonames
> cubes and loaded with geonames data from the dump downloaded on their
I don't have a source Geoname, I have an instance with a Geoname cube and
Geoname data. Thus, it may be fetched from an URL (but for now, it is only
in_memory connections). It may be even possible to think a future improvement
that allows to query SPARQL endpoints or JSONp endpoints, rather that CubicWeb
In a nutshell, it is perhaps better to keep a reference to a CubicWeb instance
as weak as possible
> > - no data stored in entity type tables
> > ...
> Could it be interesting to allow any entity to be related to a Thing
> (defined by a URL) and have some kind of Datafeed fetch the
> information in the background and make a local copy (reading the
> schema of the remote instance and creating cw_* tables when needed) ?
Making local copy will be painful as soon as we will use huge remote
instances. Moreover, it will depend on the schema of the distant instance.
E.g. if in the remote Dbpedia instance, the attribute is changed from
"depiction" to "thumbnail", IMO it is more easy to changed "depiction" to
"thumbnail" in the few RQL queries that use it, rather than migrate a
(possibly huge...) SQL table by one renaming attribute.
However, two things may be useful:
- a local cache handling system that avoids to perform multiple similar
queries (but which is not stored in the database).
- to allow any entity to be related to a Thing (defined by a URL).
Thus, any relation may have Thing as object.
Thanks again for the comments ! I will take a look to the RQL querier to see
what I can do with it !
More information about the Cubicweb