Carlos
Carlos

Reputation: 397

How to assign a value to a new column with a string condition in pandas dataframe

I try to assign values to a new column in the data-frame based on condition, if the first column contains a certain letter or not. f the first column only contains one letter, I use the dummy variable function. But how about, if the first column contains numbers, strings, and Nan?

Here is a example:

# Before
   c1
0   a
1   2
2   b
3   c
4   ab
5   bc
6   NaN

#After
    c1  a   b   c
0   a   1   0   0
1   2   0   0   0
2   b   0   1   0
3   c   0   0   1
4   ab  1   1   0
5   bc  0   1   1
6   NaN 0   0   0

I try str.contains() to assign, but I get an error:

x['a'] = 1 if x.c1.str.contains('a') else 0

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Upvotes: 1

Views: 5892

Answers (4)

ggaurav
ggaurav

Reputation: 1804

df
    c1
0   a
1   2
2   b
3   c
4   ab
5   bc
6   NaN

Firstly, you can replace NaNs with some dummy character (say #) as it will be easier to handle strings. Then you can apply list to the whole column such that you get each of the characters separately. Thereafter, you can use explode to get the each of the character in each row separted to multiple rows. Convert to dataframe and add a column of ones such that a pivot table can be created.

temp = df['c1'].fillna('#').apply(list).explode().to_frame().reset_index()
temp['vals'] = 1
temp
  index c1  vals
0   0   a   1
1   1   2   1
2   2   b   1
3   3   c   1
4   4   a   1
5   4   b   1
6   5   b   1
7   5   c   1
8   6   #   1

Then you can create the pivot_table with c1 as columns and the column with 1s as the values. After that you can just retain the columns which are alphabets. Finally, concat the temp table with original df.

temp = pd.pivot_table(temp, columns='c1', index="index", values='vals')
cols_retain = [c for c in temp.columns if re.search(r'[A-Za-z]', c)]
pd.concat([df, temp[cols_retain].fillna(0)], axis=1)
    c1  a   b   c
0   a   1.0 0.0 0.0
1   2   0.0 0.0 0.0
2   b   0.0 1.0 0.0
3   c   0.0 0.0 1.0
4   ab  1.0 1.0 0.0
5   bc  0.0 1.0 1.0
6   NaN 0.0 0.0 0.0

Upvotes: 0

Ángel Igualada
Ángel Igualada

Reputation: 891

You can do it in multiple ways one of your main problems is that your column is not an string, you can do it like:

df = pd.DataFrame([{"c1": "a"}, {"c1":2}])
df["new_column"] = 0
df["new_column"][df["c1"].astype(str).str.contains('a')] = 1

or

def custom_funct(row):
    print(row)
    if "a" in str(row["c1"]):
        row["new_column"] = 1
    else:
        row["new_column"] = 0
    return row


df = pd.DataFrame([{"c1": "a"}, {"c1":2}])
df["new_column"] = None
df = df.apply(custom_funct,axis=1)

Upvotes: 0

jfaccioni
jfaccioni

Reputation: 7519

You could do something like this:

df['a'] = df['c1'].str.contains('a').astype(int)

... but this raises a ValueError if you have any NaN values in df['c1'] (as you do in your example).

Here's an alternative using df.apply:

df['a'] = df['c1'].apply(lambda x: int('a' in x) if isinstance(x, str) else 0)

This approach also deals with columns that are composed of multiple types: it returns 1 only when a given row is a string, in addition to having the appropriate character inside.

Upvotes: 2

T1Berger
T1Berger

Reputation: 505

For your problem you can use the pandas.get_dummies() function, which convert a categorical variable into indicators

  1. Convert your dataframe to a list then (Optional)
  2. Then create classifcation dummy variables with this code:
    lst = ['a', 2, 'b', 'c', 'ab', np.nan]

    pd.get_dummies(lst).T
  1. Compare and merge the dummy-identifier for your desired result

Upvotes: 1

Related Questions