Reputation: 41
Here is the program i am developing in python -
Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server -
Step2 - I use Pandas Dataframe to load JSON in to DF using
df = pd.read_json(jsonfile,index=False, header=False
Step3 - I use df.to_csv(temp_csvfile,..)
Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..
curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from
with open(temp_csvfile,'r') as f:
curr.copy_from(f,..)
conn.commit()
I seek feedback on below points -
a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..
b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )
c. Is there any better way of doing this ..
d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).
Please advise
Upvotes: 0
Views: 2164
Reputation: 2920
You could split your input json file into many smaller files, and also use the chunk
size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from
function which supports a buffer
size parameter.
In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.
References :
Upvotes: 2