[Cubicweb] experimenting with a DataFeed entity

Florent Cayré florent at secondweb.fr
Mon Feb 15 09:44:17 CET 2010


Hi Nicolas,

this experimentation is a good start towards effectively using LinkedData,
which is a challenge because :

* we need to reference entities in someone else's database and to use it
effectively (there are of course performance issues => caching is needed) ;

* we want to keep data as up-to-date as possible, the best being our copy
being identical to the original (=> caching becomes a problem).

As far as I understand, you are trying to address performance issues using
caching, also trying to keep data up-to-date through regular polling.
I would propose two other *complementary* approaches regarding data
freshness issue :

* web hooks (http://www.webhooks.org/) : in the long term, we need a way to
be notified when the distant data changes ; the problem is we need a
standard to do so, thus it is probably not a short term solution, although
necessary to promote LinkedData usage I think ;

* browser data freshness check : each time we use a distant data cache, we
could ask the user's browser (through a javascript snippet) to check data
freshness by querying the original entity last modification date (or such)
on the original website, and use this information to eventually refresh our
copy. We just need a way to query an entity modification date effectively.

What do you think?
Florent.

2010/2/10 Nicolas Chauvat <nicolas.chauvat at logilab.fr>

> Hi List,
>
> Here is what I did last week as an experiment of continuous data
> integration. The basic goal is that I want to have in my app some web
> that is being published and continuously updated by a third party web
> site.
>
> I defined a DataFeed entity::
>
>  class DataFeed(EntityType):
>      title = String()
>      url = String(required=True)
>      parser = String(required=True)
>      refresh_time = String()
>      latest_retrieval = Datetime()
>
> Then I have a script name update-feeds.py that I run by hand for now:
>
>  $ cubicweb-ctl shell myapp update-feeds.py
>
> It fetches all the DataFeeds and looks up the parser in a table of
> functions defined in the same script. Then it runs the given parser on
> the given url.
>
> The objects output by the parser all have a URI that is used as a key.
>
> For example, I parse a web site with ads and my app has in its schema
> the entity ClassifiedAd and the relation "ClassifiedAd same_as
> ExternalUri".
>
> When the update script runs, if the object output by the parser and
> identified by its URI already exist, it is updated else it is created.
>
> Once the update script is done, my app has a local copy of the remote
> data.
>
> At the moment I have two parsers. In one case I am parsing an RSS feed
> and in the other case I am web-scrapping an HTML web page.
>
> Several improvements are underway:
>
> * the DataFeed entity has a refresh_time and latest_retrieval info
> that my update script does not use yet. The idea is that the script
> would only try to update the feed if (current_time > latest_retrieval
> + refresh_time).
>
> * the parser is ad-hoc. For each feed I would need a new parser, where
> I would like to use cubicweb.xy mappings to be able to read rdf data
> easily (althought I do not yet have played with including rdf data).
>
> I will be publishing these cubes before the end of this week.
>
> Any thoughts ?
>
> --
> Nicolas Chauvat
>
> logilab.fr - services en informatique scientifique et gestion de
> connaissances
> _______________________________________________
> Cubicweb mailing list
> Cubicweb at lists.cubicweb.org
> http://lists.cubicweb.org/mailman/listinfo/cubicweb
>



-- 
Ce message est la propriété de SecondWeb et peut contenir des informations
confidentielles. Si vous n'êtes pas le destinataire désigné, nous vous
remercions de bien vouloir nous en aviser immédiatement et de nous retourner
ce message ou de le détruire, sans faire un quelconque usage de son contenu,
ni le communiquer ou le diffuser, ni en prendre copie, électronique ou non.

This message is the property of SecondWeb and may contain confidential
information. If you are not the designated recipient, please notify us
immediately and return the message to us or destroy it, without making any
use whatsoever of the contents thereof. Furthermore you should not forward
or copy the message by electronic or other means.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cubicweb.org/pipermail/cubicweb/attachments/20100215/158210a0/attachment-0127.html>


More information about the Cubicweb mailing list