Taycir Yahmed
Taycir Yahmed

Reputation: 449

Filter DataFrame rows that are lists

Data structure:

What I am trying to do:

Filter businesses in the business_df based on their category column (which has a list structure) by categorizing a business as a restaurant if at least one of its listed categories matches at least one of the restaurant categories.

I checked these 2 questions but they did not provide an answer for my problem:

Filter dataframe rows if value in column is in a set list of values

use a list of values to select rows from a pandas dataframe

I am running this code now:

restaurant_categories_list = ['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
 print(business_df.loc[business_df['categories'].isin(restaurant_categories_list)])

And here is the column I am interested in:

0                          ['Fast Food', 'Restaurants']
1                                         ['Nightlife']
2                         ['Auto Repair', 'Automotive']
3                  ['Active Life', 'Mini Golf', 'Golf']
4     ['Shopping', 'Home Services', 'Internet Servic...
5     ['Bars', 'American (New)', 'Nightlife', 'Loung...
6     ['Active Life', 'Trainers', 'Fitness & Instruc...
7     ['Bars', 'American (Traditional)', 'Nightlife'...
8                ['Auto Repair', 'Automotive', 'Tires']
9                          ['Active Life', 'Mini Golf']
10                     ['Home Services', 'Contractors']
11                            ['Veterinarians', 'Pets']
12        ['Libraries', 'Public Services & Government']
13              ['Automotive', 'Auto Parts & Supplies']
14    ['Burgers', 'Breakfast & Brunch', 'American (T...

So if I am only working with these rows, my expected dataframe should only contain rows 0 and 14.

Upvotes: 2

Views: 936

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

UPDATE:

This version uses ast.literal_eval() in order to deserialize lists from strings and it seems to be working properly:

import ast
import pandas as pd

restaurant_categories_list=['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']

df_orig = pd.read_csv('yelp_academic_dataset_business.csv', low_memory=False)

df = df_orig[(pd.notnull(df_orig['categories']))]

mask = df['categories'].apply(ast.literal_eval).apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0

print(df.ix[mask, ['categories']])
df[mask].to_csv('result.csv', index=False)

But as @CorleyBrigman already said it's very difficult and very inefficient to work with such a data structures using Pandas...

OLD ANSWER BASED ON SAMPLE DATA:

you can convert lists to columns/Series then use pd.isin() function which will produce a matrix of True/False values, which can be summed up (because in Python: False==0 and True==1):

mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df[(mask)])

Explanation:

print(df['business'].apply(pd.Series))

                0                             1                  2      3
0       Fast Food                   Restaurants                NaN    NaN
1       Nightlife                           NaN                NaN    NaN
2     Auto Repair                    Automotive                NaN    NaN
3     Active Life                     Mini Golf               Golf    NaN
4        Shopping                 Home Services    Internet Servic    NaN
5            Bars                American (New)          Nightlife  Loung
6     Active Life                      Trainers  Fitness & Instruc    NaN
7            Bars        American (Traditional)          Nightlife    NaN
8     Auto Repair                    Automotive              Tires    NaN
9     Active Life                     Mini Golf                NaN    NaN
10  Home Services                   Contractors                NaN    NaN
11  Veterinarians                          Pets                NaN    NaN
12      Libraries  Public Services & Government                NaN    NaN
13     Automotive         Auto Parts & Supplies                NaN    NaN
14        Burgers            Breakfast & Brunch           American    NaN

then

print(df['business'].apply(pd.Series).isin(restaurant_categories_list))

output:

        0      1      2      3
0   False   True  False  False
1   False  False  False  False
2   False  False  False  False
3   False  False  False  False
4   False  False  False  False
5   False  False  False  False
6   False  False  False  False
7   False  False  False  False
8   False  False  False  False
9   False  False  False  False
10  False  False  False  False
11  False  False  False  False
12  False  False  False  False
13  False  False  False  False
14   True   True  False  False

then

mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(mask)

output:

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14     True
dtype: bool

and finally:

print(df[(mask)])

Output:

                                   business
0                  [Fast Food, Restaurants]
14  [Burgers, Breakfast & Brunch, American]

Upvotes: 1

Related Questions