[Cubicweb] Enhancing `bulk` writes in CubicWeb

Aurélien Campéas aurelien.campeas at logilab.fr
Fri May 2 15:05:26 CEST 2014


Enhancing `bulk` writes in CubicWeb
=====================================

I recently had to tackle the long-running problem of performing *fast*
database insertions *with* cubicweb.

As everyone knows, writing to a cubicweb repository using the standard
APIs (rql, orm) is quite slow, because:

* writes *always* happen one by one (with one exception), even when
  superficially looking vectorizable (e.g. "SET X foo Y WHERE ...")

* hooks are invoked for each write, and it is not possible to finely
  tune what hooks are allowed to run

Imho there should be a way to have all the niceties with better
performances. There are several ways towards this. Let's examinate
them.


Dataimport
----------

Historically, we have cubicweb.dataimport, and the dataio cube (which
expands on the dataimport API). The stated goal of this API is to
provide fast database insertions for a cubicweb instance.

API
...

Dataimport has a `Store` object, which provides the data abstration
layer allowing to push entities and relations into the repository.

Let's review the store API, provided by the base class `ObjectStore`:

* `commit` -> commit the current transaction
* `flush` -> push temporary storage to the db

* `rql` -> shortcut to session.execute

* `create_entity` -> ask an entity insertion
* `relate` -> ask a relation insertion

This is declined with the `RQLObjectStore` (which does what you
expect), the `NoHookRQLObjectStore` (which would be like the former,
except run with all hooks disabled).

Then comes the `SQLGenObjectStore`, which extends the base ObjectStore
API with:

* `create/drop_indexes`

The `dataio` cube provides a MassiveObjectStore which conforms to the
base API (?) and extends it with knobs to help squeeze out the maximum
performance. It is mostly useful when running offline.

So let's not talk too much about it.


Limitations
...........

Shouldn't this address most people's needs ? I contend that getting
performance at the detriment of correctness (no hooks, no rql, ...) is
not an interesting trade-off for all users.

We still need to use hooks (e.g. integrity hooks) and be more
efficient than baseline (bare cubicweb).



XLimport and the FlushController
---------------------------------

I developped for a customer an Excel sheet-to-database
importer/updater that initially relied on the basic cubicweb
APIs. Then the customer wanted things to go faster, but not being
incorrect (the data comes from the outside, and the excel -> schema
mapping is not completely trivial, integrity checks must be
performed. Also relation-weaving hooks cannot be escaped).

The security needs not be checked because all of this happens within
the scope of a container. So we basically check once the permission at
container level and then proceed for whatever will land inside.

API
...

Hence we have a so-called `FlushController` object (I'd be happy with
a better name, too ...) somewhat similar to the `Store` object, with
respect to its role.

The API is as such:

* `insert_entities` (etype, entitiesdicts, postprocessentity)
* `insert_relations` (rtype, fromtoentities)
* `run_deferred_hooks` (errorslist)

There is also an upcoming `delete_entities` ...

Implementation tricks:

* whole eid range reservation (really in cw 3.19),
* executemany (available at the python dbapi cursor level),
* a whole new `hooks runner`.


insert_entities
...............

* set up the standard cubicweb metadata

* prepare cubicweb meta-data tables (entities, `is` relation,
  `is_instance_of` relation)

* create an entity and prefill `cw_attr_cache` (for hooks use, later)

* add it to the session cache (for hooks use, later)

* provide default values for unvalued attributes

* call `before_add_entity` and `before_add_relation` hooks (for
  entities and their inlined relations)

* call "insertmany" (entity attributes and metadata tables)

* call postprocessentity callbacks on entities

* call `after_add_entity` and `after_add_relation` hooks for entities
  and their inlined relations


insert relations
................


* call `before_add_relation` hooks

* call "insertmany" (eid from/to in the relation table)

* update the entities related cache (mandatory to run the hooks
  afterwards without having to sprinkle the code with
  `cw_clear_caches` calls and touching the db for things we just
  inserted)

* call the `after_add_relation` hooks


The CubicWeb hooks runner
.........................


Rebuying some debt
~~~~~~~~~~~~~~~~~~

The first problem with cubicweb/server/hook.py is that it is barely
possible to understand the code. A serious refactoring is overdue.

We certainly want to separate entity hooks from relation hooks, to
avoid parts of the above confusion (hooks being undertyped for no
reason). Also we would give their entry points a more meaningful
signature::

 class EntityHook(BaseHook):

     def __call__(self, event, entity):
         pass

 class RelationHook(BaseHook):

     def __call__(self, event, eidfrom, rtype, eidto):
         pass

To sum up: we want the code to be properly TYPED and LEXICALLY SCOPED.

Notation-wise, we might even want something simpler::

 @entityhook(__select__ = is_instance('Elephant') & ~king(),
             events = ('after_add_entity', 'after_update_entity'))
 def babar_validation(session, event, entity):
     pass

That would be equivalent to::

  class BabarValidation(EntityHook):
  __regid__ = 'babar_validation'
  __select__ = EntityHook.__select__ & is_instance('Elephant') & ~king()
  events = ('after_add_entity', 'after_update_entity')

  def __call__(self, event, entity):
      pass


Fact is, it does not make a lot of sense for hooks to compete ... For
one __regid__, we might want that only one hook exists, and not play
the _select_best game.


Hooks can be deactivated only by 'category'. However, categories are
not tags (one hook can belong to just one category), and we often need
to deactivate down to the __regid__.



Going forward
~~~~~~~~~~~~~

The new hooks runner for faster insertions has the following API:

* `HooksRunner` (logger, session, disabled_regids,
                 deferred_entity_hooks, deferred_relation_hooks)

* `call_etype_hooks` (event, etype, entities, inlinedrtypes)
* `call_rtype_hooks` (event, rtype, relations)

It will prune hooks whenever possible for the duration of the
`session` (read: transaction), not run the disabled regids, not run
the deferred regids.

For the deferred regids, only entities and relations are
collected. These objects can be used later to make the relevant hooks:

* run in a subsequent transaction (makes sense for notifications of
  FTI), or

* run, but in a different form (e.g. some of the hooks/op code is
  folded into a single method)


Other considerations
--------------------

* `SET X foo Y WHERE ....` is executed entity by entity.

This used to be unavoidable because of the multi-source architecture.
I think we could fold this in the following phases:

* read: 'Any X WHERE ...' + WRITE security interpolated => 1
  query/rset

* run the `before_update` hooks

* one db query to WRITE things

* run the `after_update` hooks





More information about the Cubicweb mailing list