Reputation: 139
I have a dataframe that contains a series like this:
0 CollgCr
1 Veenker
2 CollgCr
3 Crawfor
4 NoRidge
5 Mitchel
6 Somerst
7 NWAmes
8 OldTown
9 BrkSide
Based on that series, I wanted to create a new column (feature) by grouping them.
For example, if the data is CollgCr or Veenker, then the value in the new column would be "Middle"
I tried using the following code
df_full['NeighborGrp'] = "Upper"
df_full['NeighborGrp'].loc[df_full["Neighborhood"] == "CollgCr"] = "Middle"
df_full['NeighborGrp'].loc[df_full["Neighborhood"] == ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes", "IDOTRR",
"MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]] = "Lower"
The first and second line works well, but the third row returns an error "ValueError: Arrays were different lengths"
Is there any special syntax in padas that allow me to create a new column based on multiple value condition like this?
Thanks
Upvotes: 1
Views: 480
Reputation: 863481
Use Series.map
by dictionary
with fillna
for unmatched values:
d = {'CollgCr': 'Middle',
'Veenker': 'Middle',
"Mitchel": 'Lower',
"OldTown": 'Lower',
"BrkSide": 'Lower',
"Sawyer": 'Lower',
"NAmes": 'Lower',
"IDOTRR": 'Lower',
"MeadowV": 'Lower',
"Edwards": 'Lower',
"NPkVill": 'Lower',
"BrDale": 'Lower',
"SWISU": 'Lower',
"Blueste": 'Lower'}
Or create dictionary dynamic:
Mi = ['CollgCr', 'Veenker']
Lo = ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes", "IDOTRR",
"MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]
d = {**dict.fromkeys(Lo, 'Lower'), **dict.fromkeys(Mi, 'Middle')}
df_full['new'] = df_full['city'].map(d).fillna('Upper')
print (df_full)
city new
0 CollgCr Middle
1 Veenker Middle
2 CollgCr Middle
3 Crawfor Upper
4 NoRidge Upper
5 Mitchel Lower
6 Somerst Upper
7 NWAmes Upper
8 OldTown Lower
9 BrkSide Lower
It depends of data, but map
should be fastest:
In [25]: %timeit (jez(df_full.copy()))
15 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit (raf(df_full.copy()))
20.3 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [27]: %timeit (ct(df_full.copy()))
26.9 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Code for timings:
df_full = pd.DataFrame({'city': ['CollgCr', 'Veenker', 'CollgCr', 'Crawfor',
'NoRidge', 'Mitchel', 'Somerst', 'NWAmes',
'OldTown', 'BrkSide']})
#[100000 rows x 1 columns]
df_full = pd.concat([df_full] * 10000, ignore_index=True)
def jez(df_full):
d = {'CollgCr': 'Middle',
'Veenker': 'Middle',
"Mitchel": 'Lower',
"OldTown": 'Lower',
"BrkSide": 'Lower',
"Sawyer": 'Lower',
"NAmes": 'Lower',
"IDOTRR": 'Lower',
"MeadowV": 'Lower',
"Edwards": 'Lower',
"NPkVill": 'Lower',
"BrDale": 'Lower',
"SWISU": 'Lower',
"Blueste": 'Lower'}
df_full['new'] = df_full['city'].map(d).fillna('Upper')
return df_full
def raf(df):
m = ['CollgCr', 'Veenker']
l = ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes",
"IDOTRR","MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]
df['new_col'] = np.select([df.city.isin(l), df.city.isin(m)],
['lower', 'middle'], default='upper')
return df
def ct(df):
df_types = pd.DataFrame({'CollgCr': 'Middle',
'Veenker': 'Middle',
"Mitchel": 'Lower',
"OldTown": 'Lower',
"BrkSide": 'Lower',
"Sawyer": 'Lower',
"NAmes": 'Lower',
"IDOTRR": 'Lower',
"MeadowV": 'Lower',
"Edwards": 'Lower',
"NPkVill": 'Lower',
"BrDale": 'Lower',
"SWISU": 'Lower',
"Blueste": 'Lower'}, index=['Type']).T
return df.merge(df_types, left_on='city', right_index=True, how='left').fillna('Upper')
print (jez(df_full.copy()))
print (raf(df_full.copy()))
print (ct(df_full.copy()))
Upvotes: 1
Reputation: 54380
If you have a table that can represent the relationship between City and Type, merge
would be a much more straight forward way (without having to hardcode each city in your script):
In [52]: # In reality you probably should prepare the table elsewhere and read it in as a pandas dataframe
df_types = pd.DataFrame({'CollgCr': 'Middle',
'Veenker': 'Middle',
"Mitchel": 'Lower',
"OldTown": 'Lower',
"BrkSide": 'Lower',
"Sawyer": 'Lower',
"NAmes": 'Lower',
"IDOTRR": 'Lower',
"MeadowV": 'Lower',
"Edwards": 'Lower',
"NPkVill": 'Lower',
"BrDale": 'Lower',
"SWISU": 'Lower',
"Blueste": 'Lower'}, index=['Type']).T
df = pd.DataFrame({'city': ['CollgCr', 'Veenker', 'CollgCr', 'Crawfor',
'NoRidge', 'Mitchel', 'Somerst', 'NWAmes',
'OldTown', 'BrkSide']})
df.merge(df_types, left_on='city', right_index=True, how='left').fillna('Upper')
Out[52]:
city Type
0 CollgCr Middle
1 Veenker Middle
2 CollgCr Middle
3 Crawfor Upper
4 NoRidge Upper
5 Mitchel Lower
6 Somerst Upper
7 NWAmes Upper
8 OldTown Lower
9 BrkSide Lower
Upvotes: 1
Reputation: 59284
Use np.select
and pd.Series.isin
m = ['CollgCr', 'Veenker']
l = ["Mitchel", "OldTown", "BrkSide", "Sawyer", "NAmes", "IDOTRR","MeadowV", "Edwards", "NPkVill", "BrDale", "SWISU", "Blueste"]
df['new_col'] = np.select([df.city.isin(l), df.city.isin(m)], ['lower', 'middle'], default='upper')
city new_col
0 CollgCr middle
1 Veenker middle
2 CollgCr middle
3 Crawfor upper
4 NoRidge upper
5 Mitchel lower
6 Somerst upper
7 NWAmes upper
8 OldTown lower
9 BrkSide lower
Upvotes: 3