Reputation: 2352
I have a dataframe like this:
df = pd.DataFrame({'market':['ES','UK','DE'],
'provider':['A','B','C'],
'item':['X','Y','Z']})
Then I have a list with the providers and the following loop:
providers_list = ['A','B','C']
for provider in providers_list:
a = df.loc[df['provider']==provider]
That loop creates a dataframe for each provider, which later on I put into an excel. I would like to use the function apply for speed purposes. I have transformed the code like this:
providers_list = pd.DataFrame({'provider':['A','B','C']})
def report(provider):
a = df.loc[df['provider']==provider]
providers_list.apply(report)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1190, in wrapper raise ValueError("Can only compare identically-labeled "
ValueError: ('Can only compare identically-labeled Series objects', 'occurred at index provider')
Thanks
Upvotes: 2
Views: 2287
Reputation: 164663
The apply
method is generally inefficient. It's nothing more than a glorified loop with some extra functionality. Instead, you can use GroupBy
to cycle through each provider
:
for prov, df_prov in df.groupby('provider'):
df_prov.to_excel(f'{prov}.xlsx', index=False)
If you only want to include a pre-defined list
of providers in your output, you can define a GroupBy
object and iterate your list instead:
providers_list = ['A', 'B', 'C']
grouper = df.groupby('provider')
for prov in providers_list:
grouper.get_group(prov).to_excel(f'{prov}.xlsx', index=False)
If you're interested in speed for your process as a whole, I strongly advise you avoid Excel: exporting to csv
, csv.gz
or pkl
will all be much more efficient. For large datasets, it's unlikely filtering your dataframe is your bottleneck when exporting to Excel.
Upvotes: 2
Reputation: 418
This worked for me with a milllion entries of each provider in under a second:
import pandas as pd
from tqdm import tqdm
tqdm.pandas(desc="Progress:")
df = pd.DataFrame({'market':['ES','UK','DE']*1000000,
'provider':['A','B','C']*1000000,
'item':['X','Y','Z']*1000000})
grouped = df.groupby("provider")
providers_list = ['A','B','C']
for prov in tqdm(providers_list):
frame_name = prov
globals()[frame_name] = pd.DataFrame(grouped.get_group(prov))
print(A)
print(B)
print(C)
100%|██████████| 3/3 [00:00<00:00, 9.59it/s]
Upvotes: 1