Reputation: 66
I have a big database of several json file. There are tweets and each tweet is a
{"object":"Message",
"action":"create",
"data":{"id":152376374,"body":"@PeakySellers right there with you brother. 368$","created_at":"2019-01-31T23:59:56Z",
"user":{"id":971815,"username":"mtothe5thpower",}'
}
and I have 3 million row in one file and the size is more than 5GB. I use pandas to read the file and it works well
data2=pd.read_table('file', sep="\n",header=None)
Now I have a database and in each row, there is one element (like a tweet that I mentioned earlier) and its type is String. Now I convert each element to a dictionary to use the file and access each element. I am using the code below:
for i, row in data2.itertuples():
data2["dic"][i] = json.loads(data2[0][i])
While this code successfully converts each string to a dictionary, it is very slow. I think there should be a faster way for this task. Thank you in advance for any help or suggestions.
Upvotes: 1
Views: 419
Reputation: 195573
You could load the huge file at once using readlines()
, and then join it to one huge string and decode in via json.loads()
.
Benchmark (using file with 100k Json rows):
import json
import pandas as pd
from timeit import timeit
def f1():
data2=pd.read_table('huge.txt', sep="\n",header=None)
data2['dic'] = ''
for i in range(len(data2[0])):
data2["dic"][i] = json.loads(data2[0][i])
return data2
def f2():
with open('huge.txt', 'r') as r:
l = r.readlines()
s = '[' + ','.join(l) + ']'
data = json.loads(s)
return pd.DataFrame({'dic': data})
t1 = timeit(lambda: f1(), number=1)
t2 = timeit(lambda: f2(), number=1)
print(t1)
print(t2)
Prints on my machine (AMD 2400G, Python 3.8):
102.88191884600383
0.30581573898962233
So this method seems a lot faster.
Upvotes: 1