Reputation: 383
I need to convert PGN files to JSON so that I can then use Spark to convert them to Spark DataFrames and eventually create a Graph. I have wrote a python script to parse them as a DataFrame using Pandas but it is just too slow (around 56 minutes for 170k games (original guesstimate of 30 minutes but after profile I estimate 56 minutes)). I also tried using this repo: https://github.com/JonathanCauchi/PGN-to-JSON-Parser which gave me the JSON file but took 69 minutes for the 170k games.
I can change the PGN extension to a .txt and it seems to work the exact same so I assume there is more support for .txt to JSON but I am not sure.
I think Spark will be quicker than "normal" Python but I am lost how to do the conversion. Here is a sample below. There is 2billion games though so neither of my current approaches work as it'd take nearly 2 years if I were to use the PGN-to-JSON-Parser. Ideally a .txt to Spark DataFrames and ignoring the JSON altogether would be the ideal.
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
Edit: Added Profile for 20k games.
ncalls tottime percall cumtime percall filename:lineno(function)
1 26.828 26.828 395.848 395.848 /Users/danieljones/Documents – Daniel’s iMac/GitHub/ST446Project/ParsePGN.py:11(parse_pgn)
20000 0.798 0.000 289.203 0.014 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/frame.py:7614(append)
20000 0.098 0.000 199.489 0.010 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:70(concat)
20000 0.480 0.000 126.548 0.006 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:295(__init__)
100002 0.212 0.000 122.178 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5199(_protect_consolidate)
80002 0.076 0.000 122.177 0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5210(_consolidate_inplace)
40000 0.079 0.000 122.063 0.003 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5218(_consolidate)
80002 0.170 0.000 121.830 0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5213(f)
100001 0.223 0.000 99.829 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:986(_consolidate_inplace)
59999 0.451 0.000 96.718 0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:1898(_consolidate)
80002 0.138 0.000 96.599 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:970(consolidate)
79999 52.602 0.001 91.913 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:1915(_merge_blocks)
20000 7.432 0.000 79.741 0.004 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/chess/pgn.py:1323(read_game)
20000 0.361 0.000 72.843 0.004 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:456(get_result)
I'm not sure if "cumtime" is the best column to sort by but it seems that the append step takes a lot of the time up.
Here is my script:
def parse_pgn(pgn):
games = []
i = 0
edges_df = pd.DataFrame(columns=["Event", "Round", "WhitePlayer", "BlackPlayer", "Result", "BlackElo",
"Opening", "TimeControl", "Date", "Time", "WhiteElo"])
while i < 20000:
first_game = chess.pgn.read_game(pgn)
if first_game is not None:
Event = first_game.headers["Event"]
Round = first_game.headers["Round"]
White_player = first_game.headers["White"]
Black_player = first_game.headers["Black"]
Result = first_game.headers["Result"] # Add condition to split this
if Result == "1-0":
Result = White_player
elif Result == "0-0":
Result = "Draw"
else:
Result = Black_player
BlackELO = first_game.headers["BlackElo"]
Opening = first_game.headers["Opening"]
TimeControl = first_game.headers["TimeControl"]
UTCDate = first_game.headers["UTCDate"]
UTCTime = first_game.headers["UTCTime"]
WhiteELO = first_game.headers["WhiteElo"]
edges_df = edges_df.append({"Event": Event,
"Round": Round,
"WhitePlayer": White_player,
"BlackPlayer": Black_player,
"Result": Result,
"BlackElo": BlackELO,
"Opening": Opening,
"TimeControl": TimeControl,
"Date": UTCDate,
"Time": UTCTime,
"White": WhiteELO,
}, ignore_index=True)
games.append(first_game)
i += 1
else:
pass
return edges_df
Edit 2: Changed the append method to a dictionary. 20k now takes 78 seconds. Alot of the methods taking the time seem to come from the chess
package, such as checking for legal moves, reading the board layout. None of this is important to my end goal so I am wondering if I can move away from using this package and split the file into seperate games myself, maybe at [Event
as this is the start of each distinct game.
Upvotes: 1
Views: 474
Reputation: 36550
Do not .append
to pandas.DataFrame
in loop if you want to have short run time, you might read more about this here. You might first store your dict
s in iterable and then create pandas.DataFrame
from it. I would use collections.deque
(from collections
built-in module) for that as it is designed to sport high-speed .append
, lets compare these different ways
import collections
import pandas as pd
def func1():
df = pd.DataFrame(columns=['x','y','z'])
for i in range(1000):
df = df.append({'x':i,'y':i*10,'z':i*100}, ignore_index=True)
return df
def func2():
store = collections.deque()
for i in range(1000):
store.append({'x':i,'y':i*10,'z':i*100})
df = pd.DataFrame(store, columns=['x','y','z'])
return df
These functions produce equal pandas.DataFrame
s, I compared them using built-in module timeit
following way
import timeit
print(timeit.timeit('func1()',number=10,globals={'func1':func1}))
print(timeit.timeit('func2()',number=10,globals={'func2':func2}))
and got following outcome
18.370871236000312
0.02604325699940091
that is more than 500 times faster. Naturally your mileage might vary, but I suggest giving this optimization try.
Upvotes: 1