[Cubicweb] experimenting with a DataFeed entity

Nicolas Chauvat nicolas.chauvat at logilab.fr
Wed Feb 10 16:22:42 CET 2010

Hi List,

Here is what I did last week as an experiment of continuous data
integration. The basic goal is that I want to have in my app some web
that is being published and continuously updated by a third party web

I defined a DataFeed entity::

  class DataFeed(EntityType):
      title = String()
      url = String(required=True)
      parser = String(required=True)
      refresh_time = String()
      latest_retrieval = Datetime()

Then I have a script name update-feeds.py that I run by hand for now:

  $ cubicweb-ctl shell myapp update-feeds.py

It fetches all the DataFeeds and looks up the parser in a table of
functions defined in the same script. Then it runs the given parser on
the given url.

The objects output by the parser all have a URI that is used as a key.

For example, I parse a web site with ads and my app has in its schema
the entity ClassifiedAd and the relation "ClassifiedAd same_as

When the update script runs, if the object output by the parser and
identified by its URI already exist, it is updated else it is created.

Once the update script is done, my app has a local copy of the remote

At the moment I have two parsers. In one case I am parsing an RSS feed
and in the other case I am web-scrapping an HTML web page.

Several improvements are underway:

* the DataFeed entity has a refresh_time and latest_retrieval info
that my update script does not use yet. The idea is that the script
would only try to update the feed if (current_time > latest_retrieval
+ refresh_time).

* the parser is ad-hoc. For each feed I would need a new parser, where
I would like to use cubicweb.xy mappings to be able to read rdf data
easily (althought I do not yet have played with including rdf data).

I will be publishing these cubes before the end of this week.

Any thoughts ?

Nicolas Chauvat

logilab.fr - services en informatique scientifique et gestion de connaissances  

More information about the Cubicweb mailing list