Reputation: 159
I am trying to combine OR |
with df.loc
to extract data. The code I have written extracts everything in the csv file. Here is the original csv file: https://drive.google.com/open?id=16eo29mF0pn_qNw-BGpZyVM9PBxv2aN1G
import pandas as pd
df = pd.read_csv("yelp_business.csv")
df = df.loc[(df['categories'].str.contains('chinese', case = False)) | (df['name'].str.contains('subway', case = False)) | (df['categories'].str.contains('', case = False)) | (df['address'].str.contains('', case = False))]
print df
It looks like the blank quotes ''
are not working in str.contains
or the OR |
doesn't work in df.loc
. Instead of just returning rows with chinese
restaurants (which are 4171
in number) and the row with the restaurant name subway
, it returns all the 174,568
rows.
EDITED
The output I want should be all the rows of category chinese
and all the rows of name subway
while taking into consideration that the address might not have any assigned value or is null.
import pandas as pd
df = pd.read_csv("yelp_business.csv")
cusine = 'chinese'
name = 'subway'
address #address has no assigned value or is NULL
df = df.loc[(df['categories'].str.contains(cusine, case = False)) |
(df['name'].str.contains(name, case = False)) |
(df['address'].str.contains(address, case = False))]
print df
This code gives me an error NameError: name 'address' is not defined
.
Upvotes: 0
Views: 57
Reputation: 862511
I think here is possible chain conditions by |
for categories
column, for find empty string use ^""$
- it match start and end of string with quotes:
df = pd.read_csv("yelp_business.csv")
df1 = df.loc[(df['categories'].str.contains('chinese|^""$', case = False)) |
(df['name'].str.contains('subway', case = False)) |
(df['address'].str.contains('^""$', case = False))]
print (len(df1))
11320
print (df1.head())
business_id name neighborhood \
9 TGWhGNusxyMaA4kQVBNeew "Detailing Gone Mobile" NaN
53 4srfPk1s8nlm1YusyDUbjg ***"Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
63 r6Jw8oRCeumxu7Y1WRxT7A "D&D Cleaning" NaN
88 YhV93k9uiMdr3FlV4FHjwA "Caviness Studio" NaN
address city state postal_code latitude \
9 ***"" Henderson NV 89014 36.055825
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119 36.064652
57 "8515 McCowan Road" Markham ON L3P 5E5 43.867918
63 ***"" Urbana IL 61802 40.110588
88 ***"" Phoenix AZ 85001 33.449967
longitude stars review_count is_open \
9 -115.046350 5.0 7 1
53 -115.118954 2.5 6 1
57 -79.283687 3.0 80 1
63 -88.207270 5.0 4 0
88 -112.070223 5.0 4 1
categories
9 Automotive;Auto Detailing
53 Fast Food;Restaurants;Sandwiches
57 ***Restaurants;Chinese
63 Home Cleaning;Home Services;Window Washing
88 Marketing;Men's Clothing;Restaurants;Graphic D...
EDIT: If need filter out empty and NaNs values:
df2 = df.loc[(df['categories'].str.contains('chinese', case = False)) |
(df['name'].str.contains('subway', case = False)) &
~((df['address'] == '""') | (df['categories'] == '""'))]
print (df2.head())
business_id name neighborhood \
53 4srfPk1s8nlm1YusyDUbjg "Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
96 dTWfATVrBfKj7Vdn0qWVWg "Flavor Cuisine" Scarborough
126 WUiDaFQRZ8wKYGLvmjFjAw "China Buffet" University City
145 vzx1WdVivFsaN4QYrez2rw "Subway" NaN
address city state postal_code \
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119
57 "8515 McCowan Road" Markham ON L3P 5E5
96 "8 Glen Watford Drive" Toronto ON M1S 2C1
126 "8630 University Executive Park Dr" Charlotte NC 28262
145 "5111 Boulder Hwy" Las Vegas NV 89122
latitude longitude stars review_count is_open \
53 36.064652 -115.118954 2.5 6 1
57 43.867918 -79.283687 3.0 80 1
96 43.787061 -79.276166 3.0 6 1
126 35.306173 -80.752672 3.5 76 1
145 36.112895 -115.062353 3.0 3 1
categories
53 Fast Food;Restaurants;Sandwiches
57 Restaurants;Chinese
96 Restaurants;Chinese;Food Court
126 Buffets;Restaurants;Sushi Bars;Chinese
145 Sandwiches;Restaurants;Fast Food
Upvotes: 2
Reputation: 1738
Find detail information about contains at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html
Upvotes: 0