How to read a large .jl file in python

Question

I'm trying to read the following dataset and turn it into a pandas dataframe:
https://www.kaggle.com/marlesson/meli-data-challenge-2020

It is a file with lines with the following format:

{'event_info': '...', 'event_timestamp': '...', 'event_type': '...'}
{'event_info': '...', 'event_timestamp': '...', 'event_type': '...'}
{'event_info': '...', 'event_timestamp': '...', 'event_type': '...'}

I've been trying the following but it takes too long (+60min):

import numpy as np
import pandas as pd
import fileinput
import json

%%time

df = pd.DataFrame()
with fileinput.input(files='/kaggle/input/meli-data-challenge-2020/train_dataset.jl') as file:
    for line in file:
        conv = json.loads(line)
        df = df.append(conv, ignore_index=True)
df.head()

In this code, it reads the file line by line as a string, turns each one of them into json, and then appends it into the dataframe.

Is there any way to turn the dataset into a pandas dataframe faster?

How to read a large .jl file in python

Answers (1)

Related Questions