Reputation: 449
Data structure:
A pandas DataFrame (business_df
) with a column (category
) of lists that I am interested in
A list containing restaurant categories (restaurant_categories_list
)
What I am trying to do:
Filter businesses in the business_df
based on their category
column (which has a list structure) by categorizing a business as a restaurant if at least one of its listed categories matches at least one of the restaurant categories.
I checked these 2 questions but they did not provide an answer for my problem:
Filter dataframe rows if value in column is in a set list of values
use a list of values to select rows from a pandas dataframe
I am running this code now:
restaurant_categories_list = ['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
print(business_df.loc[business_df['categories'].isin(restaurant_categories_list)])
And here is the column I am interested in:
0 ['Fast Food', 'Restaurants']
1 ['Nightlife']
2 ['Auto Repair', 'Automotive']
3 ['Active Life', 'Mini Golf', 'Golf']
4 ['Shopping', 'Home Services', 'Internet Servic...
5 ['Bars', 'American (New)', 'Nightlife', 'Loung...
6 ['Active Life', 'Trainers', 'Fitness & Instruc...
7 ['Bars', 'American (Traditional)', 'Nightlife'...
8 ['Auto Repair', 'Automotive', 'Tires']
9 ['Active Life', 'Mini Golf']
10 ['Home Services', 'Contractors']
11 ['Veterinarians', 'Pets']
12 ['Libraries', 'Public Services & Government']
13 ['Automotive', 'Auto Parts & Supplies']
14 ['Burgers', 'Breakfast & Brunch', 'American (T...
So if I am only working with these rows, my expected dataframe should only contain rows 0 and 14.
Upvotes: 2
Views: 936
Reputation: 210832
UPDATE:
This version uses ast.literal_eval()
in order to deserialize lists from strings and it seems to be working properly:
import ast
import pandas as pd
restaurant_categories_list=['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
df_orig = pd.read_csv('yelp_academic_dataset_business.csv', low_memory=False)
df = df_orig[(pd.notnull(df_orig['categories']))]
mask = df['categories'].apply(ast.literal_eval).apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df.ix[mask, ['categories']])
df[mask].to_csv('result.csv', index=False)
But as @CorleyBrigman already said it's very difficult and very inefficient to work with such a data structures using Pandas...
OLD ANSWER BASED ON SAMPLE DATA:
you can convert lists to columns/Series then use pd.isin()
function which will produce a matrix of True/False values, which can be summed up (because in Python: False==0 and True==1):
mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df[(mask)])
Explanation:
print(df['business'].apply(pd.Series))
0 1 2 3
0 Fast Food Restaurants NaN NaN
1 Nightlife NaN NaN NaN
2 Auto Repair Automotive NaN NaN
3 Active Life Mini Golf Golf NaN
4 Shopping Home Services Internet Servic NaN
5 Bars American (New) Nightlife Loung
6 Active Life Trainers Fitness & Instruc NaN
7 Bars American (Traditional) Nightlife NaN
8 Auto Repair Automotive Tires NaN
9 Active Life Mini Golf NaN NaN
10 Home Services Contractors NaN NaN
11 Veterinarians Pets NaN NaN
12 Libraries Public Services & Government NaN NaN
13 Automotive Auto Parts & Supplies NaN NaN
14 Burgers Breakfast & Brunch American NaN
then
print(df['business'].apply(pd.Series).isin(restaurant_categories_list))
output:
0 1 2 3
0 False True False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 True True False False
then
mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(mask)
output:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
dtype: bool
and finally:
print(df[(mask)])
Output:
business
0 [Fast Food, Restaurants]
14 [Burgers, Breakfast & Brunch, American]
Upvotes: 1