Ali Ghalandari
Ali Ghalandari

Reputation: 66

Converting big String file to json in python using Pandas

I have a big database of several json file. There are tweets and each tweet is a

{"object":"Message",
"action":"create",
"data":{"id":152376374,"body":"@PeakySellers right there with you brother. 368$","created_at":"2019-01-31T23:59:56Z",
"user":{"id":971815,"username":"mtothe5thpower",}'
}

and I have 3 million row in one file and the size is more than 5GB. I use pandas to read the file and it works well data2=pd.read_table('file', sep="\n",header=None)

Now I have a database and in each row, there is one element (like a tweet that I mentioned earlier) and its type is String. Now I convert each element to a dictionary to use the file and access each element. I am using the code below:

for i, row in data2.itertuples():
    data2["dic"][i] = json.loads(data2[0][i])

While this code successfully converts each string to a dictionary, it is very slow. I think there should be a faster way for this task. Thank you in advance for any help or suggestions.

Upvotes: 1

Views: 419

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195573

You could load the huge file at once using readlines(), and then join it to one huge string and decode in via json.loads().

Benchmark (using file with 100k Json rows):

import json
import pandas as pd
from timeit import timeit

def f1():
    data2=pd.read_table('huge.txt', sep="\n",header=None)

    data2['dic'] = ''

    for i in range(len(data2[0])):
        data2["dic"][i] = json.loads(data2[0][i])

    return data2

def f2():
    with open('huge.txt', 'r') as r:
        l = r.readlines()

    s = '[' + ','.join(l) + ']'

    data = json.loads(s)

    return pd.DataFrame({'dic': data})

t1 = timeit(lambda: f1(), number=1)
t2 = timeit(lambda: f2(), number=1)

print(t1)
print(t2)

Prints on my machine (AMD 2400G, Python 3.8):

102.88191884600383
0.30581573898962233

So this method seems a lot faster.

Upvotes: 1

Related Questions