kdemi001
kdemi001

Reputation: 35

Python: Counting values for columns with multiple values per entry in dataframe

I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].

Let's say it's something like this:

RestaurantName City   Restaurant ID Cuisines
Restaurant A    Milan    31333         French, Spanish, Italian
Restaurant B    Shanghai 63551         Pizza, Burgers
Restaurant C    Dubai    7991          Burgers, Ice Cream

Here's a copy-able code as a sample:

rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'], 
                   'City': ['Milan', 'Shanghai', 'Dubai'],
                    'RestaurantID': [31333,63551,7991],
                    'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})

I used string split to expand them into 8 different columns and added it to the dataframe.

csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]

Which leaves me with this: https://i.sstatic.net/AUSDY.png

Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.

I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.

dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())

Which leaves me with all of these columns: https://i.sstatic.net/84spI.png

A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.

AllCsn=np.concatenate((rst.Cuisine1.unique(), 
                rst.Cuisine2.unique(),
                rst.Cuisine3.unique(),
                rst.Cuisine4.unique(),
                rst.Cuisine5.unique(),
                rst.Cuisine6.unique(),
                rst.Cuisine7.unique(),
                rst.Cuisine8.unique()
               ))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn

Which leaves me with this:

https://i.sstatic.net/O9OpW.png

I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.

I am new to this, so please bear with me and let me know if I need to provide any more info.

Upvotes: 1

Views: 305

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35646

It sounds like you're looking for str.split without expanding, then explode:

rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')

Creates a frame like:

  RestaurantName      City  RestaurantID   Cuisines
0         Rest A     Milan         31333     French
0         Rest A     Milan         31333    Spanish
0         Rest A     Milan         31333    Italian
1         Rest B  Shanghai         63551      Pizza
1         Rest B  Shanghai         63551    Burgers
2         Rest C     Dubai          7991    Burgers
2         Rest C     Dubai          7991  Ice Cream

Then it sounds like either crosstab:

pd.crosstab(rst['City'], rst['Cuisines'])
Cuisines  Burgers  French  Ice Cream  Italian  Pizza  Spanish
City                                                         
Dubai           1       0          1        0      0        0
Milan           0       1          0        1      0        1
Shanghai        1       0          0        0      1        0

Or value_counts

rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')
       City   Cuisines  counts
0     Dubai    Burgers       1
1     Dubai  Ice Cream       1
2     Milan     French       1
3     Milan    Italian       1
4     Milan    Spanish       1
5  Shanghai    Burgers       1
6  Shanghai      Pizza       1

Max value_count per City via groupby head:

max_counts = (
    rst[['City', 'Cuisines']].value_counts()
        .groupby(level=0).head(1)
        .reset_index(name='counts')
)

max_counts:

       City Cuisines  counts
0     Dubai  Burgers       1
1     Milan   French       1
2  Shanghai  Burgers       1

Upvotes: 2

Related Questions