How to improve performance of a script operating on large amount of data?

My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.

When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).

I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.

I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.

I wonder if RelStorage would be faster in my case.

Here's how I currently setup ZODB connection:

import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()

It's clear for me that ZODB is currently the bottleneck of my script. I'm looking for advice on how I could solve this problem.

I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.

Code and data structures:

root data structures:

if not hasattr(dbroot, 'actions_values'):
    dbroot.actions_values = BTree()

if not hasattr(dbroot, 'games_played'):
    dbroot.games_played = 0

actions_values is conceptually built as follows:

actions_values = { # BTree
    str(state): { # BTree
        # contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
        # and their values(only actions previously taken by the angent are present here), e.g.:
        1: 0.4356
        5: 0.3456
    },
    # other states
}

state is a simple 2D array representing game board. Possible vales of it's fields are 1, 2 or None:

board = [ [ None ] * cols for _ in xrange(rows) ]

(in my case rows = 6 and cols = 7)

main loop:

should_play = 10000000
transactions_freq = 10000
packing_freq = 50000

player = ReinforcementPlayer(dbroot.actions_values, config)

while dbroot.games_played < should_play:
    # max_epsilon at start and then linearly drops to min_epsilon:
    epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)

    dbroot.games_played += 1
    sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
    sys.stdout.flush()

    board_state = player.play_game(epsilon)

    if(dbroot.games_played % transactions_freq == 0):
        print('Commiting...')
        transaction.commit()
    if(dbroot.games_played % packing_freq == 0):
        print('Packing DB...')
        connection.db().pack()

(packing also takes much time but it's not the main problem; I could pack database after program finishes)

Code operating on dbroot (inside ReinforcementPlayer):

def get_actions_with_values(self, player_id, state):
    if player_id == 1:
        lookup_state = state
    else:
        lookup_state = state.switch_players()
    lookup_state_str = str(lookup_state)
    if lookup_state_str in self.actions_values:
        return self.actions_values[lookup_state_str]
    mirror_lookup_state_str = str(lookup_state.mirror())
    if mirror_lookup_state_str in self.actions_values:
        return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
    return None

def get_value_of_action(self, player_id, state, action, default=0):
    actions = self.get_actions_with_values(player_id, state)
    if actions is None:
        return default
    return actions.get(action, default)

def set_value_of_action(self, player_id, state, action, value):
    if player_id == 1:
        lookup_state = state
    else:
        lookup_state = state.switch_players()
    lookup_state_str = str(lookup_state)
    if lookup_state_str in self.actions_values:
        self.actions_values[lookup_state_str][action] = value
        return
    mirror_lookup_state_str = str(lookup_state.mirror())
    if mirror_lookup_state_str in self.actions_values:
        self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
        return
    self.actions_values[lookup_state_str] = BTree()
    self.actions_values[lookup_state_str][action] = value

(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)

After 550000 games len(dbroot.actions_values) is 6018450.

According to iotop IO operations take 90% of the time.

Upvotes: 1

Answers (2)

Matt Hamilton

Reputation: 815

Just to be clear here, which BTree class are you actually using? An OOBTree?

Two aspects about those btrees:

1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.

2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.

3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).

-Matt

Upvotes: 1

Mikko Ohtamaa

Reputation: 83488

Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.

Some ideas what can be done:

Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have __slots__ trick)
Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case

Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.

Upvotes: 2

How to improve performance of a script operating on large amount of data?

Answers (2)

Related Questions