BlueTurtle
BlueTurtle

Reputation: 383

Converting large text/pgn files to JSON in Spark

I need to convert PGN files to JSON so that I can then use Spark to convert them to Spark DataFrames and eventually create a Graph. I have wrote a python script to parse them as a DataFrame using Pandas but it is just too slow (around 56 minutes for 170k games (original guesstimate of 30 minutes but after profile I estimate 56 minutes)). I also tried using this repo: https://github.com/JonathanCauchi/PGN-to-JSON-Parser which gave me the JSON file but took 69 minutes for the 170k games.

I can change the PGN extension to a .txt and it seems to work the exact same so I assume there is more support for .txt to JSON but I am not sure.

I think Spark will be quicker than "normal" Python but I am lost how to do the conversion. Here is a sample below. There is 2billion games though so neither of my current approaches work as it'd take nearly 2 years if I were to use the PGN-to-JSON-Parser. Ideally a .txt to Spark DataFrames and ignoring the JSON altogether would be the ideal.

[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]

1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0

Edit: Added Profile for 20k games.

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   26.828   26.828  395.848  395.848 /Users/danieljones/Documents – Daniel’s iMac/GitHub/ST446Project/ParsePGN.py:11(parse_pgn)
    20000    0.798    0.000  289.203    0.014 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/frame.py:7614(append)
    20000    0.098    0.000  199.489    0.010 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:70(concat)
    20000    0.480    0.000  126.548    0.006 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:295(__init__)
   100002    0.212    0.000  122.178    0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5199(_protect_consolidate)
    80002    0.076    0.000  122.177    0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5210(_consolidate_inplace)
    40000    0.079    0.000  122.063    0.003 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5218(_consolidate)
    80002    0.170    0.000  121.830    0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5213(f)
   100001    0.223    0.000   99.829    0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:986(_consolidate_inplace)
    59999    0.451    0.000   96.718    0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:1898(_consolidate)
    80002    0.138    0.000   96.599    0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:970(consolidate)
    79999   52.602    0.001   91.913    0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:1915(_merge_blocks)
    20000    7.432    0.000   79.741    0.004 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/chess/pgn.py:1323(read_game)
    20000    0.361    0.000   72.843    0.004 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:456(get_result)

I'm not sure if "cumtime" is the best column to sort by but it seems that the append step takes a lot of the time up.

Here is my script:

def parse_pgn(pgn):
    games = []
    i = 0
    edges_df = pd.DataFrame(columns=["Event", "Round", "WhitePlayer", "BlackPlayer", "Result", "BlackElo",
                                     "Opening", "TimeControl", "Date", "Time", "WhiteElo"])

    while i < 20000:
        first_game = chess.pgn.read_game(pgn)

        if first_game is not None:
            Event = first_game.headers["Event"]
            Round = first_game.headers["Round"]
            White_player = first_game.headers["White"]
            Black_player = first_game.headers["Black"]
            Result = first_game.headers["Result"]  # Add condition to split this
            if Result == "1-0":
                Result = White_player
            elif Result == "0-0":
                Result = "Draw"
            else:
                Result = Black_player
            BlackELO = first_game.headers["BlackElo"]
            Opening = first_game.headers["Opening"]
            TimeControl = first_game.headers["TimeControl"]
            UTCDate = first_game.headers["UTCDate"]
            UTCTime = first_game.headers["UTCTime"]
            WhiteELO = first_game.headers["WhiteElo"]
            edges_df = edges_df.append({"Event": Event,
                                                "Round": Round,
                                                "WhitePlayer": White_player,
                                                "BlackPlayer": Black_player,
                                                "Result": Result,
                                                "BlackElo": BlackELO,
                                                "Opening": Opening,
                                                "TimeControl": TimeControl,
                                                "Date": UTCDate,
                                                "Time": UTCTime,
                                                "White": WhiteELO,
                                                }, ignore_index=True)
            games.append(first_game)
            i += 1
        else:
            pass

    return edges_df

Edit 2: Changed the append method to a dictionary. 20k now takes 78 seconds. Alot of the methods taking the time seem to come from the chess package, such as checking for legal moves, reading the board layout. None of this is important to my end goal so I am wondering if I can move away from using this package and split the file into seperate games myself, maybe at [Event as this is the start of each distinct game.

Upvotes: 1

Views: 474

Answers (1)

Daweo
Daweo

Reputation: 36550

Do not .append to pandas.DataFrame in loop if you want to have short run time, you might read more about this here. You might first store your dicts in iterable and then create pandas.DataFrame from it. I would use collections.deque (from collections built-in module) for that as it is designed to sport high-speed .append, lets compare these different ways

import collections
import pandas as pd
def func1():
    df = pd.DataFrame(columns=['x','y','z'])
    for i in range(1000):
        df = df.append({'x':i,'y':i*10,'z':i*100}, ignore_index=True)
    return df
def func2():
    store = collections.deque()
    for i in range(1000):
        store.append({'x':i,'y':i*10,'z':i*100})
    df = pd.DataFrame(store, columns=['x','y','z'])
    return df

These functions produce equal pandas.DataFrames, I compared them using built-in module timeit following way

import timeit
print(timeit.timeit('func1()',number=10,globals={'func1':func1}))
print(timeit.timeit('func2()',number=10,globals={'func2':func2}))

and got following outcome

18.370871236000312
0.02604325699940091

that is more than 500 times faster. Naturally your mileage might vary, but I suggest giving this optimization try.

Upvotes: 1

Related Questions