CIGEEK
CIGEEK

Reputation: 41

Python load huge csv to postgresql

Here is the program i am developing in python -

Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server - Step2 - I use Pandas Dataframe to load JSON in to DF using df = pd.read_json(jsonfile,index=False, header=False

Step3 - I use df.to_csv(temp_csvfile,..)

Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..

curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from 
 with open(temp_csvfile,'r') as f:
     curr.copy_from(f,..)
 conn.commit()

I seek feedback on below points -

a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..

b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )

c. Is there any better way of doing this ..

d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).

Please advise

Upvotes: 0

Views: 2164

Answers (1)

Anand Sowmithiran
Anand Sowmithiran

Reputation: 2920

You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter. In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.

References :

  1. Loading 20gb json file in pandas
  2. Loading dataframes data into postgresql table article
  3. Read a large json file into pandas

Upvotes: 2

Related Questions