Reputation: 397
I try to assign values to a new column in the data-frame based on condition, if the first column contains a certain letter or not. f the first column only contains one letter, I use the dummy variable function. But how about, if the first column contains numbers, strings, and Nan?
Here is a example:
# Before
c1
0 a
1 2
2 b
3 c
4 ab
5 bc
6 NaN
#After
c1 a b c
0 a 1 0 0
1 2 0 0 0
2 b 0 1 0
3 c 0 0 1
4 ab 1 1 0
5 bc 0 1 1
6 NaN 0 0 0
I try str.contains()
to assign, but I get an error:
x['a'] = 1 if x.c1.str.contains('a') else 0
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Upvotes: 1
Views: 5892
Reputation: 1804
df
c1
0 a
1 2
2 b
3 c
4 ab
5 bc
6 NaN
Firstly, you can replace NaN
s with some dummy character (say #) as it will be easier to handle strings. Then you can apply
list
to the whole column such that you get each of the characters separately. Thereafter, you can use explode
to get the each of the character in each row separted to multiple rows. Convert to dataframe and add a column of ones such that a pivot table can be created.
temp = df['c1'].fillna('#').apply(list).explode().to_frame().reset_index()
temp['vals'] = 1
temp
index c1 vals
0 0 a 1
1 1 2 1
2 2 b 1
3 3 c 1
4 4 a 1
5 4 b 1
6 5 b 1
7 5 c 1
8 6 # 1
Then you can create the pivot_table
with c1
as columns and the column with 1s as the values. After that you can just retain the columns which are alphabets. Finally, concat the temp
table with original df.
temp = pd.pivot_table(temp, columns='c1', index="index", values='vals')
cols_retain = [c for c in temp.columns if re.search(r'[A-Za-z]', c)]
pd.concat([df, temp[cols_retain].fillna(0)], axis=1)
c1 a b c
0 a 1.0 0.0 0.0
1 2 0.0 0.0 0.0
2 b 0.0 1.0 0.0
3 c 0.0 0.0 1.0
4 ab 1.0 1.0 0.0
5 bc 0.0 1.0 1.0
6 NaN 0.0 0.0 0.0
Upvotes: 0
Reputation: 891
You can do it in multiple ways one of your main problems is that your column is not an string, you can do it like:
df = pd.DataFrame([{"c1": "a"}, {"c1":2}])
df["new_column"] = 0
df["new_column"][df["c1"].astype(str).str.contains('a')] = 1
or
def custom_funct(row):
print(row)
if "a" in str(row["c1"]):
row["new_column"] = 1
else:
row["new_column"] = 0
return row
df = pd.DataFrame([{"c1": "a"}, {"c1":2}])
df["new_column"] = None
df = df.apply(custom_funct,axis=1)
Upvotes: 0
Reputation: 7519
You could do something like this:
df['a'] = df['c1'].str.contains('a').astype(int)
... but this raises a ValueError
if you have any NaN
values in df['c1']
(as you do in your example).
Here's an alternative using df.apply
:
df['a'] = df['c1'].apply(lambda x: int('a' in x) if isinstance(x, str) else 0)
This approach also deals with columns that are composed of multiple types: it returns 1 only when a given row is a string, in addition to having the appropriate character inside.
Upvotes: 2
Reputation: 505
For your problem you can use the pandas.get_dummies() function, which convert a categorical variable into indicators
lst = ['a', 2, 'b', 'c', 'ab', np.nan]
pd.get_dummies(lst).T
Upvotes: 1