[Cubicweb] CWEP 0003 - Clause FROM
Vincent Michel
vincent.michel at logilab.fr
Mon Feb 10 16:42:13 CET 2014
Hi List,
Here are some details on the "Clause FROM" which aims at allowing
multisources queries
in RQL/CW.
Feedbacks/possible use cases welcome !
Best,
Vincent
=======================
CWEP-0003 Clause FROM
=======================
:champion: Vincent Michel
:reviewers: COPIL
:last update: 2014/02/05
:status: draft
Introduction
============
A recurent need in some CubicWeb projects is to be able to query distant
databases within an RQL request,
in order to enrich dynamically a query with data from instances or services.
In the following, we call:
* *local instance* the CubicWeb instance that is the entry point of an
application,
and where the RQL queries start;
* *remote service* another CW instance (or more generally a service as
we might
want to query a sparql endpoint or a web service), that may contain
useful information
for the *local instance*.
Use cases
~~~~~~~~~
Exemples of use cases:
* Example 1 - a *local instance* contains data about medical studies,
where patients
are linked to diseases only defined by an URI::
P is Patient, P related_diseases D, D is Disease, D uri
"http://dbpedia.org/resource/Epilepsy"
A *remote service* contains detailed data about diseases, e.g. the
mortality rate
of diseases. We would like to filter the patients of the *local
instance* based on the mortality rate
retrieved from the distance instance::
Any P WHERE P related_diseases D
WITH D BEING (Any X WHERE X mortality_rate > 0.5)
FROM "http://baseurl"
* Example 2 - a *local instance* contains information about authors, and
we want to query the thumbnails
of the authors with a name as "Victor %"::
Any T WHERE A name N, A name LIKE "Victor %"
WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
FROM "http://baseurl"
* Example 3 - a *local instance* contains information about news
articles, with links to identified objects
in the text::
X is NewsArticle, X recognized_entities E, E uri
"http://dbpedia.org/resource/Barack_Obama"
We want to filter the news articles based thata are about any
President of the USA, retrieving
the data from the Dbpedia sparqlendpoint::
select distinct ?c where
{?c dcterms:subject
<http://dbpedia.org/resource/Category:Presidents_of_the_United_States>}
i.e.::
Any X WHERE X is NewsArticle, X recognized_entities E
WITH E FROM "http://dbpedia.org/sparql?q=select
distinct ?c where
{?c dcterms:subject
<http://dbpedia.org/resource/Category:Presidents_of_the_United_States>}"
* Example 4 - Same case as example 3, but we could use a more specific API::
Any X, E WHERE X is NewsArticle, X recognized_entities E
WITH E FROM "http://some.site.com/api/presidents"
Specifications
==============
API Specifications
~~~~~~~~~~~~~~~~~~
From these examples, we can define the following specifications for the
API:
* Retrieve data from CW instance using a RQL query defined in the
`BEING` clause
and a base url defined in the `FROM` clause (example 1);
* Retrieve data from Sparql endpoint using a SPARQL query defined as a
string in the `BEING` clause
and a base url defined in the `FROM` clause (example 3);
* Be able to join the data in the *local instance* (examples 1 and 3) or
in the *remote service*
(example 2). Indeed, in the example 2, we do not want to get ALL the
thumbnails for all the authors
from the distante instance, but we would rather only send all the
names that verify "Victor %"
to the *remote service*.
* Sources may be defined:
* by an url, e.g. "http://www.cubicweb.org" or
"http://www.dbpedia.org/sparql";
* by an appid, if running on the same machine, e.g. "dbpedia",
"geonames".
In this case, this could be the name of specific CWSource.
Implementation specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following specifications come from different use cases:
* The request must be **fully dynamic**. We don't want to store any data
of a *remote service* on the *local instance* (for memory sake, or
because we want a query to reflect the last changes in a *remote service*).
* The *local* and *remote* instances should be fully separated. We don't
want any dependancies between them, in terms of code (e.g. in the
schema) or availability of the instances (i.e. if a *remote service* is
down, the *locale instance* should still works).
* The *local instance* will only work with base types for data from
*remote services*, i.e. the notions of Etype is not transferable (at
least, the eid will be integer). I.e. if the description of the rset on
the distante instance is::
[['Project'], ['Person'], ['Float'], ['String'],]
The description that will be used by the locale instance is:
[['Int'], ['Int'], ['Float'], ['String'],]
It could be interesting to think about a potential (and NOT mandatory)
cache on some requests (Postgres temporary table, more high-level cache
on HTTP requests, ...)
Foreign Data Wrapper are not directly usable for this CWEP, as they
required the definition of a specific table schema, and we do not have
it before executing the query, especially in the case of SPARQL query/ API.
Moreover, this may be too intrusive on the locale instance in terms of
memory.
Data join
~~~~~~~~~
Two possible joins are possible, on attributes or on etypes:
* the join on attributes is straitghforward::
Any T WHERE A name N, A name LIKE "Victor %"
WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
FROM "http://baseurl"
In this case, we join on the value of N. For now, this is not
supported by RQL
(see "Known limitations" below).
* the join on entities is a syntatic sugar based on URI.
It could be interesting to automatically join the entities based on
their `cwuri` if the variable
use to join is an etype::
Any T WHERE A name "Victor %", A same_as Y
WITH T BEING (Any T WHERE Y thumbnails T)
FROM "http://baseurl"
In this case, we send to the *remote service* all the uris of
entities similar
to entities having a name matching "Victor %". The uris will be used
in the *remote* query.
This could be rewriten as::
Any T WHERE A name "Victor %", A same_as X, X cwuri U
WITH T BEING (Any T WHERE Y cwuri U, Y thumbnails T)
FROM "http://baseurl"
Proposition of RQL
~~~~~~~~~~~~~~~~~~
Simple cases (pure remote query)::
Any N WITH N BEING (Any X WHERE X is Project, X name N) FROM
"http://www.cubicweb.org"
Any X WITH X BEING (Any X WHERE X is Dbpediapage) FROM "dbpedia"
Join in query::
Any X,CC WHERE X same_as Y WITH CC BEING (Any CC WHERE Y country
CC) FROM "geonames"
Any N, M WITH N BEING (Any X WHERE X is Project, X name N)
FROM "http://www.cubicweb.org"
M BEING (Any X WHERE X is Project, X name N)
FROM "http://www.cubicweb.org"
Remote join in query::
Any T WHERE A name N, A name "Victor %"
WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
FROM "http://baseurl" JOINBY N
Any T WHERE A name "Victor %", A same_as Y
WITH T BEING (Any T WHERE Y thumbnails T)
FROM "http://baseurl" JOINBY T
Implementation
==============
In PostgreSQL, it is possible to execute::
SELECT _T0.C0 FROM (SELECT _X.c0 AS C0 FROM
(VALUES('cubicweb-mobile'), ('cubicweb-workcase'),
...
('pyqonsole'),
('pyreverse'), ('yams'))
AS _X (c0)) AS _T0;
Thus it is possible to let CubicWeb doing all the join logic by simply
replacing the function
that generates the SQL statement for the subqueries.
POC
~~~
A proof-of-concept implementation is available here:
* http://hg.logilab.org/users/vmichel/cubicweb
* http://hg.logilab.org/users/vmichel/rql
For RQL, it implements:
* the grammar/tests for ``FROM`` in RQL;
* the logic of annotations/checking of the RQL syntax tree;
For CubicWeb, it implements:
* the execution of the remote query and the join with the local query;
Known limitations
~~~~~~~~~~~~~~~~~
The `WITH` subqueries in CubicWeb create a new variables scope, which
the join on *remote services*.
This is currently the most important limitation, as the join on *remote
services* is one of the biggest use case of the `FROM` clause.
Possible improvements
~~~~~~~~~~~~~~~~~~~~~
In the specific case of a *remote service* based on RQL, it could be
interesting to implement a specific API used by the *local* querier to
retrieve statistics on the *remote* execution plan, thus allowing to
decide if the join by be done on the *local* instance or the *remote* one.
References
~~~~~~~~~~
Some exemples and references:
* Foreign Data Wrapper::
CREATE FOREIGN TABLE pw
(username text, pw text, uid int, gid int,
comments text, homedir text, shell text)
SERVER file_fdw
OPTIONS (format 'csv', delimiter ':',
filename '/etc/passwd');
For more details, see:
* http://multicorn.org/foreign-data-wrappers/#idrss-foreign-data-wrapper
* http://people.planetpostgresql.org/andrew/uploads/fdw2.pdf
* SPARQL Service::
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name
FROM <http://example.org/myfoaf.rdf>
WHERE
{
<http://example.org/myfoaf/I> foaf:knows ?person .
SERVICE <http://people.example.org/sparql> {
?person foaf:name ?name . }
}
For more details, see::
* http://www.w3.org/TR/sparql11-federated-query/
More information about the Cubicweb
mailing list