Reputation: 685
I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
Upvotes: 1
Views: 886
Reputation: 210832
As @John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.
First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.
Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.
For loading you data into MySQL DB you can and should use native LOAD DATA tool.
Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)
Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.
Upvotes: 0
Reputation: 249133
If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite
console program and do something like this:
sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user
Don't forget to build appropriate indexes before doing any queries.
Upvotes: 1