piotre10
piotre10

Reputation: 71

How to store large amount of key value pairs efficently?

I'm writing a ML project where I need to acquire huge number of positions and their final outcomes (game of checkers). Position is represented as tuple of 32 integers between 0 and 4 (included) f. e:

pos = (0, 4, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0)

for each position I want to store:

result = white_wins - black_wins
num_games # total number of games

Because total number of positions is enourmus (I'm not even sure if it is managable). Certainly It is bounded from above by sth arround 2.3 x 10^22 and in reality total number of diffrent (possible to occure in game) positions is a lot smaller maybe even sth order of magnitude of 10^19 or less. At one site I found that there is 5 x 10^20 possible games (move orders) that game can take which would indicate that number of unique positions is drastically smaller, but I'm not sure if I used same set of rules etc.

It's worth noting that firstly game gets played and then based on move record each position is added in a loop.

From here I tried to different aproaches:

1. Storing data in python dict as:

dict = {pos1: [result1, num_games1], ...}

And adding each position with:

def add_pos(dict, pos, who_won): # who_won = 1 if white wins and -1 if black 0 if draw
    if pos in dict.keys():
        dict[pos][0] += who_won
        dict[pos][1] += 1
    else:
        dict[pos] = [who_won, 1]

Then storing it with pytorch to .pth file with:

torch.save(dict, f'saves/pos_dict.pth', pickle_protocol=HIGHEST_PROTOCOL)

2. Storing it in sqlite3 database

Created with:

self.cur.execute(""" CREATE TABLE IF NOT EXISTS pos_table (
                            pos TEXT PRIMARY KEY ,
                            result INTEGER NOT NULL,
                            num_games INTEGER NOT NULL,
                            eval REAL NOT NULL );""")

Adding position:

def add_pos(cur, pos, who_won):
    pos_str = str(pos)
    cur.execute("SELECT rowid FROM pos_table WHERE pos=?;",(temp,))
    row = self.cur.fetchone()

    if row is None:
        tuple_to_insert = (pos_str, who_won, 1, who_won)
        cur.execute("INSERT INTO pos_table VALUES (?, ?, ?, ?);", tuple_to_insert)
    else:
        cur.execute("""UPDATE pos_table 
                       SET result=result+(?),
                           num_games = num_games+1,
                           eval = result/CAST(num_games AS REAL)
                           WHERE pos LIKE ?;""", (who_won, pos_str))

Main problems:

in both cases I would gladly get rid off "is position already in db check" or somehow generalize it for f.e. merging to dictionaries but I wasn't able to come up with any alghorithm that would be faster.

Problem with first aproach is that although it's reasonably fast it eats up a lot of RAM, also saving and loading it to file gets painfully slow as number of unique positions gets around 5 000 000.

Only problem I find with SQL is that it gets really slow really fast (around 500 000 - 700 000 unique positions) which makes it borderline useless at 1 000 000 or more rows

Fixes:

Easy and probably necessary fix is to pick some features of position f.e. number of pawns, queens etc. and therefore reduce number of distinguishable positions.

I heard about shelves although rejected it as unlikely to solve my problem (did not tested it)

Therefore I have few questions:

  1. Is there way to improve data-storage system in order to increase number of unique positions that can be stored?

  2. Approximetly how many distiguishable positions (after reducing to set of features) I would be able to handle efficently? - it's very important that finding value associated with key does not take a lot of time

  3. Is there way to improve "inserting alghorithm"?

Upvotes: 0

Views: 1077

Answers (1)

Shawn
Shawn

Reputation: 52589

For SQLite, some tips to improve performance:

  1. Since your primary key isn't an INTEGER, make it a WITHOUT ROWID table. This will reduce the number of lookups that have to be done in the database to retrieve the values associated with the key from 2 (First in an index to find the corresponding rowid, then in the table itself) to 1 (Directly looking up in the table).
self.cur.execute(""" CREATE TABLE IF NOT EXISTS pos_table (
                            pos TEXT PRIMARY KEY NOT NULL,
                            result INTEGER NOT NULL,
                            num_games INTEGER NOT NULL,
                            eval REAL NOT NULL ) WITHOUT ROWID;""")
  1. Use UPSERT syntax to need only a single query to insert or update.
def add_pos(cur, pos, who_won):
    pos_str = str(pos)
    cur.execute("""INSERT INTO pos_table(pos, result, num_games, eval)
                   VALUES (?1, ?2, 1, ?2)
                   ON CONFLICT(pos) DO UPDATE
                   SET result = result + ?2,
                       num_games = num_games + 1,
                       eval = result/CAST(num_games AS REAL);""", (pos_str, who_won))

You can also store the position in a more compact form than a stringified tuple to save space, by using 3 bits per element in an array of bytes (A total of 12 bytes for 32 elements). The bitstruct module makes this easy.

Change the type of the pos column to BLOB, and something like

import bitstruct
# ...
# Pack the positions into bytes
pos_str = bitstruct.pack("u3" * len(pos), *pos)
# Unpack back into a tuple
pos_tuple = bitstruct.unpack("u3" * 32, pos_str)

A non-sqlite alternative that you might explore is the shelve module, which provides persistent storage of key-value pairs:

import shelve
shelfdb=shelve.open("positions")

def add_pos(shelf, pos, who_won)
    pos_str=str(pos)
    if pos_str in shelf:
        tmp = shelf[pos_str]
        tmp[0] += who_won
        tmp[1] += 1
        shelf[pos_str] = tmp
    else:
        shelf[pos_str] = [who_won, 1]

# Close before exiting!
shelfdb.close()

Upvotes: 1

Related Questions