Reputation: 685

How to reduce time it takes to append to SQL database in python

I want to append about 700 millions rows and 2 columns to a database. Using the code below:

disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1

for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
    df.to_sql('data', disk_engine, if_exists='append')
    count = j*chunksize
    print(count)
    print(j)

It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?

Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:

#load data(600 million rows * 2 columns) into database
#def count(screen):
  #return count of distinct list of users for a given set of screens

Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?

Upvotes: 1

Answers (2)

MaxU - stand with Ukraine

Reputation: 210832

As @John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.

First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.

Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.

For loading you data into MySQL DB you can and should use native LOAD DATA tool.

Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)

Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.

Upvotes: 0

John Zwinck

Reputation: 249133

If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:

sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user

Don't forget to build appropriate indexes before doing any queries.

Upvotes: 1

How to reduce time it takes to append to SQL database in python

Answers (2)

Related Questions