Reputation: 775
I have a Pandas dataframe with 1000s of rows. and it has the Names
column includes the customer names and their records. I want to create individual dataframes for each customer based on their unique names. I got the unique names into a list
customerNames = DataFrame['customer name'].unique().tolist()
this gives the following array
['Name1', 'Name2', 'Name3, 'Name4']
I tried a loop by catching the unique names in the above list and creating dataframes for each name and assign the dataframes to the customer name. So for example when I write Name3
, it should give the Name3
's data as a separate dataframe
for x in customerNames:
x = DataFrame.loc[DataFrame['customer name'] == x]
x
Above lines returned the dataframe for only Name4
as dataframe result, but skipped the rest.
How can I solve this problem?
Upvotes: 4
Views: 36172
Reputation: 62393
To create a dataframe for all the unique values in a column, create a dict
of dataframes, as follows.
dict
, where each key is a unique value from the column of choice and the value is a dataframe.df_names['Name1']
).groupby()
creates a generator, which can be unpacked.
k
is the unique values in the column and v
is the data associated with each k
.for-loop
and .groupby
:df_names = dict()
for k, v in df.groupby('customer name'):
df_names[k] = v
.groupby
df_names = {k: v for (k, v) in df.groupby('customer name')}
.groupby
is faster than .unique
.
.groupby
is faster, at 104 ms compared to 392 ms.groupby
is faster, at 147 ms compared to 1.53 s.for-loop
is slightly faster than a comprehension, particularly for more unique column values or lots of rows (e.g. 10M)..unique
:df_names = {name: df[df['customer name'] == name] for name in df['customer name'].unique()}
import pandas as pd
import string
import random
random.seed(365)
# 6 unique values
data = {'class': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(1000000)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}
# 26 unique values
data = {'class': [random.choice( list(string.ascii_lowercase)) for _ in range(1000000)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}
df = pd.DataFrame(data)
Upvotes: 9
Reputation: 1
maybe i get you wrong but
when
for x in customerNames:
x = DataFrame.loc[DataFrame['customer name'] == x]
x
gives you the right output for the last list entry its because your output is out of the indent of the loop
import pandas as pd
customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA'])],
orient='index', columns=['customer', 'country'])
customer_list = ['James', 'Jean']
for x in customer_list:
x = customer_df.loc[customer_df['customer'] == x]
print(x)
print('now I could append the data to something new')
you get the output:
customer country
B James USA
now I could append the data to something new
customer country
A Jean France
now I could append the data to something new
Or if you dont like loops you could go with
import pandas as pd
customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA']),('C', ['Hans', 'Germany'])],
orient='index', columns=['customer', 'country'])
customer_list = ['James', 'Jean']
print(customer_df[customer_df['customer'].isin(customer_list)])
Output:
customer country
A Jean France
B James USA
df.isin is better explained under:How to implement 'in' and 'not in' for Pandas dataframe
Upvotes: 0
Reputation: 1522
Your current iteration overwrites x
twice every time it runs: the for
loop assigns a customer name to x
, and then you assign a dataframe to it.
To be able to call each dataframe later by name, try storing them in a dictionary:
df_dict = {name: df.loc[df['customer name'] == name] for name in customerNames}
df_dict['Name3']
Upvotes: 10