How to use pandas apply instead a list loop?

Question

I have a dataframe like this:

df = pd.DataFrame({'market':['ES','UK','DE'],
                   'provider':['A','B','C'],
                   'item':['X','Y','Z']})

Then I have a list with the providers and the following loop:

providers_list = ['A','B','C']
for provider in providers_list:
  a = df.loc[df['provider']==provider]

That loop creates a dataframe for each provider, which later on I put into an excel. I would like to use the function apply for speed purposes. I have transformed the code like this:

providers_list = pd.DataFrame({'provider':['A','B','C']})
def report(provider):
 a = df.loc[df['provider']==provider]
providers_list.apply(report)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1190, in wrapper raise ValueError("Can only compare identically-labeled "

ValueError: ('Can only compare identically-labeled Series objects', 'occurred at index provider')

Thanks

jpp · Accepted Answer

The apply method is generally inefficient. It's nothing more than a glorified loop with some extra functionality. Instead, you can use GroupBy to cycle through each provider:

for prov, df_prov in df.groupby('provider'):
    df_prov.to_excel(f'{prov}.xlsx', index=False)

If you only want to include a pre-defined list of providers in your output, you can define a GroupBy object and iterate your list instead:

providers_list = ['A', 'B', 'C']
grouper = df.groupby('provider')

for prov in providers_list:
    grouper.get_group(prov).to_excel(f'{prov}.xlsx', index=False)

If you're interested in speed for your process as a whole, I strongly advise you avoid Excel: exporting to csv, csv.gz or pkl will all be much more efficient. For large datasets, it's unlikely filtering your dataframe is your bottleneck when exporting to Excel.

How to use pandas apply instead a list loop?

Answers (2)

Related Questions