Pandas: how to make algorithm faster

Question

I have the task: I should find some data in big file and add this data to some file. File, where I search data is 22 million string and I divide it using chunksize. In other file I have column with 600 id of users and I find info about every users in big file. The first I divide data to interval and next search information about every user in all of this files. I use timer to know, how many time it spend to write to file and average time to find information in df size 1 million string and write it to file is 1.7 sec. And after count all time of program I get 6 hours. (1.5 sec * 600 id * 22 interval). I want to do it faster, but I don't know any way besides chunksize. I add my code

el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
    i['used_at'] = pd.to_datetime(i['used_at'])
    df = i.sort_values(['ID', 'used_at'])
    dates = df['used_at']
    ids = df['ID']
    urls = df['url']
    for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
        start = time.time()
        df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
        df1 = DataFrame(df1)
        if df1.empty:
            continue
        else:
            with open('3.csv', 'a') as f:
                df1.to_csv(f, header=False)
                end = time.time()
                print(end - start)

Pandas: how to make algorithm faster

Answers (1)

Related Questions