Reputation: 35
I have a csv file of 50 mn records and I would like to deal with it using pandas. While I am loading it into data frame my system is getting hung. Any thoughts would be great help?
Upvotes: 1
Views: 424
Reputation: 2129
Read the csv and store it in a SQLite in-memory database.
import pandas as pd
from sqlalchemy import create_engine # database connection
import datetime as dt
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
start = dt.datetime.now()
chunksize = 20000
j = 0
index_start = 1
for df in pd.read_csv('big.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) # Remove spaces from columns
df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # Convert to datetimes
df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
df.index += index_start
# Remove the un-interesting columns
columns = ['Agency', 'CreatedDate', 'ClosedDate', 'ComplaintType', 'Descriptor',
'CreatedDate', 'ClosedDate', 'TimeToCompletion',
'City']
for c in df.columns:
if c not in columns:
df = df.drop(c, axis=1)
j+=1
print '{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize)
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
df = pd.read_sql_query('SELECT * FROM data LIMIT 3', disk_engine)
You can then query whatever you like.
Upvotes: 2