[Cubicweb] CWEP 0003 - Clause FROM

Vincent Michel vincent.michel at logilab.fr
Mon Feb 10 16:42:13 CET 2014

Hi List,

Here are some details on the "Clause FROM" which aims at allowing 
multisources queries
in RQL/CW.

Feedbacks/possible use cases welcome !



  CWEP-0003 Clause FROM

:champion: Vincent Michel
:reviewers: COPIL
:last update: 2014/02/05
:status: draft


A recurent need in some CubicWeb projects is to be able to query distant 
databases within an RQL request,
in order to enrich dynamically a query with data from instances or services.

In the following, we call:

* *local instance* the CubicWeb instance that is the entry point of an 
   and where the RQL queries start;

* *remote service* another CW instance (or more generally a service as 
we might
   want to query a sparql endpoint or a web service), that may contain 
useful information
   for the *local instance*.

Use cases

Exemples of use cases:

* Example 1 - a *local instance* contains data about medical studies, 
where patients
   are linked to diseases only defined by an URI::

        P is Patient, P related_diseases D, D is Disease, D uri 

   A *remote service* contains detailed data about diseases, e.g. the 
mortality rate
   of diseases. We would like to filter the patients of the *local 
instance* based on the mortality rate
   retrieved from the distance instance::

       Any P WHERE P related_diseases D
                   WITH D BEING (Any X WHERE X mortality_rate > 0.5)
                          FROM "http://baseurl"

* Example 2 - a *local instance* contains information about authors, and 
we want to query the thumbnails
   of the authors with a name as "Victor %"::

       Any T WHERE A name N, A name LIKE "Victor %"
                   WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
                          FROM "http://baseurl"

* Example 3 - a *local instance* contains information about news 
articles, with links to identified objects
   in the text::

      X is NewsArticle, X recognized_entities E, E uri 

   We want to filter the news articles based thata are about any 
President of the USA, retrieving
   the data from the Dbpedia sparqlendpoint::

       select distinct ?c where
       {?c dcterms:subject 


       Any X WHERE X is NewsArticle, X recognized_entities E
                   WITH E FROM "http://dbpedia.org/sparql?q=select 
distinct ?c where
                               {?c dcterms:subject

* Example 4 - Same case as example 3, but we could use a more specific API::

       Any X, E WHERE X is NewsArticle, X recognized_entities E
                    WITH E FROM "http://some.site.com/api/presidents"


API Specifications

 From these examples, we can define the following specifications for the 

* Retrieve data from CW instance using a RQL query defined in the 
`BEING` clause
   and a base url defined in the `FROM` clause (example 1);

* Retrieve data from Sparql endpoint using a SPARQL query defined as a 
string in the `BEING` clause
   and a base url defined in the `FROM` clause (example 3);

* Be able to join the data in the *local instance* (examples 1 and 3) or 
in the *remote service*
   (example 2). Indeed, in the example 2, we do not want to get ALL the 
thumbnails for all the authors
   from the distante instance, but we would rather only send all the 
names that verify "Victor %"
   to the *remote service*.

* Sources may be defined:

   * by an url, e.g. "http://www.cubicweb.org" or 

   * by an appid, if running on the same machine, e.g. "dbpedia", 
     In this case, this could be the name of specific CWSource.

Implementation specifications

The following specifications come from different use cases:

* The request must be **fully dynamic**. We don't want to store any data 
of a *remote service* on the *local instance* (for memory sake, or 
because we want a query to reflect the last changes in a *remote service*).

* The *local* and *remote* instances should be fully separated. We don't 
want any dependancies between them, in terms of code (e.g. in the 
schema) or availability of the instances (i.e. if a *remote service* is 
down, the *locale instance* should still works).

* The *local instance* will only work with base types for data from 
*remote services*, i.e. the notions of Etype is not transferable (at 
least, the eid will be integer). I.e. if the description of the rset on 
the distante instance is::

    [['Project'], ['Person'], ['Float'], ['String'],]

The description that will be used by the locale instance is:

    [['Int'], ['Int'], ['Float'], ['String'],]

It could be interesting to think about a potential (and NOT mandatory) 
cache on some requests (Postgres temporary table, more high-level cache 
on HTTP requests, ...)
Foreign Data Wrapper are not directly usable for this CWEP, as they 
required the definition of a specific table schema, and we do not have 
it before executing the query, especially in the case of SPARQL query/ API.
Moreover, this may be too intrusive on the locale instance in terms of 

Data join

Two possible joins are possible, on attributes or on etypes:

* the join on attributes is straitghforward::

       Any T WHERE A name N, A name LIKE "Victor %"
                   WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
                          FROM "http://baseurl"

   In this case, we join on the value of N. For now, this is not 
supported by RQL
   (see "Known limitations" below).

* the join on entities is a syntatic sugar based  on URI.
   It could be interesting to automatically join the entities based on 
their `cwuri` if the variable
   use to join is an etype::

       Any T WHERE A name "Victor %", A same_as Y
                   WITH T BEING (Any T WHERE Y thumbnails T)
                          FROM "http://baseurl"

   In this case, we send to the *remote service* all the uris of 
entities similar
   to entities having a name matching "Victor %". The uris will be used 
in the *remote* query.
   This could be rewriten as::

       Any T WHERE A name "Victor %", A same_as X, X cwuri U
                   WITH T BEING (Any T WHERE Y cwuri U, Y thumbnails T)
                          FROM "http://baseurl"

Proposition of RQL

Simple cases (pure remote query)::

      Any N WITH N BEING (Any X WHERE X is Project, X name N) FROM 

      Any X WITH X BEING (Any X WHERE X is Dbpediapage) FROM "dbpedia"

Join in query::

      Any X,CC WHERE X same_as Y WITH CC BEING (Any CC WHERE Y country 
CC) FROM "geonames"

      Any N, M WITH N BEING (Any X WHERE X is Project, X name N)
                      FROM "http://www.cubicweb.org"
                    M BEING (Any X WHERE X is Project, X name N)
                      FROM "http://www.cubicweb.org"

Remote join in query::

       Any T WHERE A name N, A name "Victor %"
                   WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
                          FROM "http://baseurl" JOINBY N

       Any T WHERE A name "Victor %", A same_as Y
                   WITH T BEING (Any T WHERE Y thumbnails T)
                          FROM "http://baseurl" JOINBY T


In PostgreSQL, it is possible to execute::

(VALUES('cubicweb-mobile'), ('cubicweb-workcase'),
('pyreverse'), ('yams'))
                       AS _X (c0)) AS _T0;

Thus it is possible to let CubicWeb doing all the join logic by simply 
replacing the function
that generates the SQL statement for the subqueries.


A proof-of-concept implementation is available here:

* http://hg.logilab.org/users/vmichel/cubicweb

* http://hg.logilab.org/users/vmichel/rql

For RQL, it implements:

* the grammar/tests for ``FROM`` in RQL;

* the logic of annotations/checking of the RQL syntax tree;

For CubicWeb, it implements:

* the execution of the remote query and the join with the local query;

Known limitations

The `WITH` subqueries in CubicWeb create a new variables scope, which 
the join on *remote services*.
This is currently the most important limitation, as the join on *remote 
services* is one of the biggest use case of the `FROM` clause.

Possible improvements

In the specific case of a *remote service* based on RQL, it could be 
interesting to implement a specific API used by the *local* querier to 
retrieve statistics on the *remote* execution plan, thus allowing to 
decide if the join by be done on the *local* instance or the *remote* one.


Some exemples and references:

* Foreign Data Wrapper::

       (username text, pw text, uid int, gid int,
       comments text, homedir text, shell text)
       SERVER file_fdw
       OPTIONS (format 'csv', delimiter ':',
       filename '/etc/passwd');

   For more details, see:

   * http://multicorn.org/foreign-data-wrappers/#idrss-foreign-data-wrapper

   * http://people.planetpostgresql.org/andrew/uploads/fdw2.pdf

* SPARQL Service::

       PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
       SELECT ?name
       FROM <http://example.org/myfoaf.rdf>
          <http://example.org/myfoaf/I> foaf:knows ?person .
          SERVICE <http://people.example.org/sparql> {
            ?person foaf:name ?name . }

    For more details, see::

    * http://www.w3.org/TR/sparql11-federated-query/

More information about the Cubicweb mailing list