lbh57
lbh57

Reputation: 31

Create a True/False column in Python DataFrame(1,0) based on two columns values

I'm having trouble creating a new column based on columns 'language_1' and 'language_2' in python dataframe. I want to create a 'bilingual' column where a '1' represents a user who speaks both English and Spanish(bi-lingual) and a 0 for non-bilingual speakers. Ultimately I want to compare their average ratings to each other, but want to categorize them first. I tried using if statements but I'm not sure how to write an if statement that combines multiple conditions to result in 1 value. Thank you for any help.


===============================================================================================

name          language_1             language_2          rating      bilingual                                           
Kevin          English                 Null               4.25
Miguel         English                 Spanish             4.56
Carlos         English                  Spanish            4.61
Aaron          Null                     Spanish            4.33


===============================================================================================

Here is the code I've tried to use to append the new column to my dataframe.

def label_bilingual(row):
    if row('language_english') == row['English'] and row('language_spanish') == 'Spanish':
        val = 1
    else:
        val = 0

df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)

Here is the error I'm getting.

----> 1 df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
'Series' object is not callable

Upvotes: 0

Views: 663

Answers (2)

PacketLoss
PacketLoss

Reputation: 5746

You have a few issues with your function, one which is causing your error and a few more which will cause more problems after.


1 - You have tried to call the column with row('name') which is not callable.

df('row')
Traceback (most recent call last):
  File "<pyshell#30>", line 1, in <module>
    df('row')
TypeError: 'DataFrame' object is not callable

2 - You have tried to compare row['column'] to row['English'] which will not work, as a column named English does not exist

KeyError: 'English'

3 - You do not return any values

    val = 1

    val = 0

You need to modify your function as below to resolve these errors.


def label_bilingual(row):
    if row['language_1'] == 'English' and row['language_2'] == 'Spanish':
        return 1
    else:
        return 0

Output

>>> df['bilingual'] = df.apply(label_bilingual, axis=1)
>>> df
     name language_1 language_2  rating  bilingual
0   Kevin    English       Null    4.25          0
1  Miguel    English    Spanish    4.56          1
2  Carlos    English    Spanish    4.61          1
3   Aaron       Null    Spanish    4.33          0

Upvotes: 1

NotAName
NotAName

Reputation: 4322

To make it simpler I'd suggest having missing values in either column as numpy.nan. For example if missing values were recorded as np.nan:

bilingual = np.where(np.isnan(df[['language_1', 'language_2']].values.any(), 0, 1))
df['bilingual'] = bilingual

Here np.where checks condition inside, which in turn checks whether values in either of language columns are missing. And if true, than a person is not bilingual and gets a 0, otherwise 1.

Upvotes: 0

Related Questions