Reputation: 2135
I am trying to find if a string exists across multiple columns. I would like to return a 1 if the string exists and 0 if it doesn't as a new series within the dataframe.
After searching the forums, I understand that str.contains could be used, but i'm searching over 100+ columns therefore it isn't efficient for me to work with individual series at a time.
There are some NAs within the columns if this is relevant.
Example simplified dataframe:
d = {'strings_1': ['AA', 'AB', 'AV'], 'strings_2': ['BB', 'BA', 'AG'],
'strings_1': ['AE', 'AC', 'AI'], 'strings_3': ['AA', 'DD', 'PP'],
'strings_4': ['AV', 'AB', 'BV']}
simple_df = pd.DataFrame(data=d)
If I am interested in finding 'AA' for example, I would like to return the following dataframe.
Example target dataframe:
d = {'strings_1': ['AA', 'AB', 'AV'], 'strings_2': ['BB', 'BA', 'AG'],
'strings_1': ['AE', 'AC', 'AI'], 'strings_3': ['AA', 'DD', 'PP'],
'strings_4': ['AV', 'AB', 'BV'], 'AA_TRUE': [1, 0, 0]}
target_df = pd.DataFrame(data=d)
Many thanks for help.
Upvotes: 9
Views: 8816
Reputation: 862601
If need check mixed values - numeric with strings compare numpy array created by values
, use DataFrame.any
for check at least one True
per row and last cast to int
:
simple_df['new'] = (simple_df.values == 'AA').any(1).astype(int)
#or cast all values to string before comparing
#simple_df['new'] = (simple_df.astype(str)== 'AA').any(1).astype(int)
print (simple_df)
strings_1 strings_2 strings_3 strings_4 new
0 AE BB AA AV 1
1 AC BA DD AB 0
2 AI AG PP BV 0
Detail:
print ((simple_df.values == 'AA'))
[[False False True False False]
[False False False False False]
[False False False False False]]
print ((simple_df.values == 'AA').any(1))
[ True False False]
If need check substring:
simple_df['new'] = simple_df.applymap(lambda x: 'G' in x).any(1).astype(int)
print (simple_df)
strings_1 strings_2 strings_3 strings_4 new
0 AE BB AA AV 0
1 AC BA DD AB 0
2 AI AG PP BV 1
Upvotes: 13