Reputation: 2253
I have the task: I should find some data in big file and add this data to some file.
File, where I search data is 22 million string
and I divide it using chunksize
.
In other file I have column with 600 id of users
and I find info about every users in big file.
The first I divide data to interval and next search information about every user in all of this files.
I use timer
to know, how many time it spend to write to file and average time to find information in df
size 1 million string
and write it to file is 1.7 sec
. And after count all time of program I get 6 hours
. (1.5 sec * 600 id * 22 interval
).
I want to do it faster, but I don't know any way besides chunksize
.
I add my code
el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
i['used_at'] = pd.to_datetime(i['used_at'])
df = i.sort_values(['ID', 'used_at'])
dates = df['used_at']
ids = df['ID']
urls = df['url']
for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
start = time.time()
df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
df1 = DataFrame(df1)
if df1.empty:
continue
else:
with open('3.csv', 'a') as f:
df1.to_csv(f, header=False)
end = time.time()
print(end - start)
Upvotes: 0
Views: 155
Reputation: 5212
There are some issues in your code
zip
takes arguments that may be of different length
dateutil.relativedelta
may not be compatible with pandas Timestamp.
With pandas 0.18.1 and python 3.5, I'm getting this:
now = pd.Timestamp.now()
now
Out[46]: Timestamp('2016-07-06 15:32:44.266720')
now + dateutil.relativedelta.relativedelta(day=5)
Out[47]: Timestamp('2016-07-05 15:32:44.266720')
So it's better to use pd.Timedelta
now + pd.Timedelta(5, 'D')
Out[48]: Timestamp('2016-07-11 15:32:44.266720')
But it's somewhat inaccurate for months:
now - pd.Timedelta(1, 'M')
Out[49]: Timestamp('2016-06-06 05:03:38.266720')
This is a sketch of code. I didn't test and I may be wrong about what you want. The crucial part is to merge the two data frames instead of iterating row by row.
# 1) convert to datetime here
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys
# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')
# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
# not necessary
# dates1 = buys['date']
# ids1 = buys['id']
for chunk in el:
# already converted to datetime
# i['used_at'] = pd.to_datetime(i['used_at'])
# defer sorting until later
# df = i.sort_values(['ID', 'used_at'])
# merge!
# (option how='inner' selects only rows that have the same id in both data frames; it's default)
merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
selected = merged.loc[bool_idx]
# probably don't need additional columns from buys,
# so either drop them or select the ones from chunk (beware of possible duplicates in names)
selected = selected[chunk.columns]
# sort now (possibly a smaller frame)
selected = selected.sort_values(['ID', 'used_at'])
if selected.empty:
continue
with open('3.csv', 'a') as f:
selected.to_csv(f, header=False)
Hope this helps. Please double check the code and adjust to your needs.
Please, take a look at the docs to understand the options of merge
.
Upvotes: 1