NineWasps
NineWasps

Reputation: 2253

Pandas: how to make algorithm faster

I have the task: I should find some data in big file and add this data to some file. File, where I search data is 22 million string and I divide it using chunksize. In other file I have column with 600 id of users and I find info about every users in big file. The first I divide data to interval and next search information about every user in all of this files. I use timer to know, how many time it spend to write to file and average time to find information in df size 1 million string and write it to file is 1.7 sec. And after count all time of program I get 6 hours. (1.5 sec * 600 id * 22 interval). I want to do it faster, but I don't know any way besides chunksize. I add my code

el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
    i['used_at'] = pd.to_datetime(i['used_at'])
    df = i.sort_values(['ID', 'used_at'])
    dates = df['used_at']
    ids = df['ID']
    urls = df['url']
    for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
        start = time.time()
        df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
        df1 = DataFrame(df1)
        if df1.empty:
            continue
        else:
            with open('3.csv', 'a') as f:
                df1.to_csv(f, header=False)
                end = time.time()
                print(end - start)

Upvotes: 0

Views: 155

Answers (1)

ptrj
ptrj

Reputation: 5212

There are some issues in your code

  1. zip takes arguments that may be of different length

  2. dateutil.relativedelta may not be compatible with pandas Timestamp. With pandas 0.18.1 and python 3.5, I'm getting this:

    now = pd.Timestamp.now()
    now
    Out[46]: Timestamp('2016-07-06 15:32:44.266720')
    now + dateutil.relativedelta.relativedelta(day=5)
    Out[47]: Timestamp('2016-07-05 15:32:44.266720')
    

    So it's better to use pd.Timedelta

    now + pd.Timedelta(5, 'D')
    Out[48]: Timestamp('2016-07-11 15:32:44.266720')
    

    But it's somewhat inaccurate for months:

    now - pd.Timedelta(1, 'M')
    Out[49]: Timestamp('2016-06-06 05:03:38.266720')
    

This is a sketch of code. I didn't test and I may be wrong about what you want. The crucial part is to merge the two data frames instead of iterating row by row.

# 1) convert to datetime here 
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])

buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys

# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')

# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))

# not necessary
# dates1 = buys['date']
# ids1 = buys['id']

for chunk in el:
    # already converted to datetime
    # i['used_at'] = pd.to_datetime(i['used_at'])

    # defer sorting until later
    # df = i.sort_values(['ID', 'used_at'])

    # merge!
    # (option how='inner' selects only rows that have the same id in both data frames; it's default)
    merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
    bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
    selected = merged.loc[bool_idx]

    # probably don't need additional columns from buys, 
    # so either drop them or select the ones from chunk (beware of possible duplicates in names)
    selected = selected[chunk.columns]

    # sort now (possibly a smaller frame)
    selected = selected.sort_values(['ID', 'used_at'])

    if selected.empty:
        continue
    with open('3.csv', 'a') as f:
        selected.to_csv(f, header=False)

Hope this helps. Please double check the code and adjust to your needs.

Please, take a look at the docs to understand the options of merge.

Upvotes: 1

Related Questions