Javier Lopez Tomas
Javier Lopez Tomas

Reputation: 2352

How to use pandas apply instead a list loop?

I have a dataframe like this:

df = pd.DataFrame({'market':['ES','UK','DE'],
                   'provider':['A','B','C'],
                   'item':['X','Y','Z']})

Then I have a list with the providers and the following loop:

providers_list = ['A','B','C']
for provider in providers_list:
  a = df.loc[df['provider']==provider]

That loop creates a dataframe for each provider, which later on I put into an excel. I would like to use the function apply for speed purposes. I have transformed the code like this:

providers_list = pd.DataFrame({'provider':['A','B','C']})
def report(provider):
 a = df.loc[df['provider']==provider]
providers_list.apply(report)  

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1190, in wrapper raise ValueError("Can only compare identically-labeled "

ValueError: ('Can only compare identically-labeled Series objects', 'occurred at index provider')

Thanks

Upvotes: 2

Views: 2287

Answers (2)

jpp
jpp

Reputation: 164663

The apply method is generally inefficient. It's nothing more than a glorified loop with some extra functionality. Instead, you can use GroupBy to cycle through each provider:

for prov, df_prov in df.groupby('provider'):
    df_prov.to_excel(f'{prov}.xlsx', index=False)

If you only want to include a pre-defined list of providers in your output, you can define a GroupBy object and iterate your list instead:

providers_list = ['A', 'B', 'C']
grouper = df.groupby('provider')

for prov in providers_list:
    grouper.get_group(prov).to_excel(f'{prov}.xlsx', index=False)

If you're interested in speed for your process as a whole, I strongly advise you avoid Excel: exporting to csv, csv.gz or pkl will all be much more efficient. For large datasets, it's unlikely filtering your dataframe is your bottleneck when exporting to Excel.

Upvotes: 2

Lukas Humpe
Lukas Humpe

Reputation: 418

This worked for me with a milllion entries of each provider in under a second:

import pandas as pd
from tqdm import tqdm

tqdm.pandas(desc="Progress:")


df = pd.DataFrame({'market':['ES','UK','DE']*1000000,
                   'provider':['A','B','C']*1000000,
                   'item':['X','Y','Z']*1000000})

grouped = df.groupby("provider")
providers_list = ['A','B','C']

for prov in tqdm(providers_list):
    frame_name = prov
    globals()[frame_name] = pd.DataFrame(grouped.get_group(prov))

print(A)
print(B)
print(C)

100%|██████████| 3/3 [00:00<00:00,  9.59it/s]

Upvotes: 1

Related Questions