Reputation: 768
I have a dataframe that has a column of boroughs visited (among many other columns):
Index User Boroughs_visited
0 Eminem Manhattan, Bronx
1 BrSpears NaN
2 Elvis Brooklyn
3 Adele Queens, Brooklyn
I want to create a third column that shows which User visited Brooklyn, so I wrote the slowest code possible in python:
df['Brooklyn']= 0
def borough():
for index,x in enumerate(df['Boroughs_visited']):
if pd.isnull(x):
continue
elif re.search(r'\bBrooklyn\b',x):
df_vols['Brooklyn'][index]= 1
borough()
Resulting in:
Index User Boroughs_visited Brooklyn
0 Eminem Manhattan, Bronx 0
1 BrSpears NaN 0
2 Elvis Brooklyn 1
3 Adele Queens, Brooklyn 1
It took my computer 15 seconds to run this for 2000 rows. Is there a faster way of doing this?
Upvotes: 2
Views: 837
Reputation: 294258
You can get all Boroughs for the price of one
df.join(df.Boroughs_visited.str.get_dummies(sep=', '))
Index User Boroughs_visited Bronx Brooklyn Manhattan Queens
0 0 Eminem Manhattan, Bronx 1 0 1 0
1 1 BrSpears NaN 0 0 0 0
2 2 Elvis Brooklyn 0 1 0 0
3 3 Adele Queens, Brooklyn 0 1 0 1
But if you really, really just wanted Brooklyn
df.join(df.Boroughs_visited.str.get_dummies(sep=', ').Brooklyn)
Index User Boroughs_visited Brooklyn
0 0 Eminem Manhattan, Bronx 0
1 1 BrSpears NaN 0
2 2 Elvis Brooklyn 1
3 3 Adele Queens, Brooklyn 1
Upvotes: 1
Reputation: 153460
Let use .str
accessor with contains
and fillna
:
df['Brooklyn'] = (df.Boroughs_visited.str.contains('Brooklyn') * 1).fillna(0)
Or another format of the same statement:
df['Brooklyn'] = df.Boroughs_visited.str.contains('Brooklyn').mul(1, fill_value=0)
Output:
Index User Boroughs_visited Brooklyn
0 0 Eminem Manhattan, Bronx 0
1 1 BrSpears NaN None 0
2 2 Elvis Brooklyn 1
3 3 Adele Queens, Brooklyn 1
Upvotes: 2