Luke
Luke

Reputation: 1377

How to improve performance of a script operating on large amount of data?

My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.

When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).

I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.

I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.

I wonder if RelStorage would be faster in my case.

Here's how I currently setup ZODB connection:

import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()

It's clear for me that ZODB is currently the bottleneck of my script. I'm looking for advice on how I could solve this problem.

I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.


Code and data structures:


After 550000 games len(dbroot.actions_values) is 6018450.


According to iotop IO operations take 90% of the time.

Upvotes: 1

Views: 547

Answers (2)

Matt Hamilton
Matt Hamilton

Reputation: 815

Just to be clear here, which BTree class are you actually using? An OOBTree?

Two aspects about those btrees:

1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.

2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.

3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).

-Matt

Upvotes: 1

Mikko Ohtamaa
Mikko Ohtamaa

Reputation: 83488

Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.

Some ideas what can be done:

  • Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.

  • Make the objects themselves smaller (Python classes have __slots__ trick)

  • Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.

  • Parallel processing - use all CPU cores instead of single threaded approach

  • Don't use BTrees - maybe there is something more efficient for your use case

Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.

Upvotes: 2

Related Questions