Reputation: 307
I have a pandas dataframe like below. I want to search a text in each row of the dataframe and highlight if that text appears in the row.
For example, I want to search each row for "jones". I want to ignore the case of my search word. In the below case, I would like to add a new column to data called "jones" and it would have values 1,1,0 as that word was found in 1st and 2nd row
I found this post which shows how to find a text in a column, but how could I find a text when I have many columns - say 50+? I thought about concatenating all the columns and creating a new column, but didn't see any function that would concatenate all columns of a dataframe (without asking to type each column name)
I would like to do this for multiple keywords that I have. For example I have list of keyword LLC, Co, Blue, alpha
and many more (30+)
sales = [{'account': 'Jones LLC', 'Jan': '150', 'Feb': '200', 'Mar': '140'},
{'account': 'Alpha Co', 'Jan': 'Jones', 'Feb': '210', 'Mar': '215'},
{'account': 'Blue Inc', 'Jan': '50', 'Feb': '90', 'Mar': '95' }]
df = pd.DataFrame(sales)
Source DF:
Feb Jan Mar account
0 200 150 140 Jones LLC
1 210 Jones 215 Alpha Co
2 90 50 95 Blue Inc
Desired DF:
Feb Jan Mar account jones llc co blue alpha
0 200 150 140 Jones LLC 1 1 0 0 0
1 210 Jones 215 Alpha Co 1 0 1 0 1
2 90 50 95 Blue Inc 0 0 0 1 0
Upvotes: 4
Views: 4349
Reputation: 210832
UPDATE: you seem to want OneHotEncode some specific words - you can use sklearn.feature_extraction.text.CountVectorizer for that:
In [131]: from sklearn.feature_extraction.text import CountVectorizer
In [132]: vocab = ['jones', 'llc', 'co', 'blue', 'alpha']
In [133]: cv = CountVectorizer(vocabulary=vocab)
In [134]: r = pd.SparseDataFrame((cv.fit_transform(df.select_dtypes('object').add(' ').sum(1)) != 0) * 1,
df.index,
cv.get_feature_names(),
default_fill_value=0)
In [135]: r
Out[135]:
jones llc co blue alpha
0 1 1 0 0 0
1 1 0 1 0 1
2 0 0 0 1 0
you can also merge it with your original DF:
In [137]: df = df.join(r)
In [138]: df
Out[138]:
Feb Jan Mar account jones llc co blue alpha
0 200 150 140 Jones LLC 1 1 0 0 0
1 210 Jones 215 Alpha Co 1 0 1 0 1
2 90 50 95 Blue Inc 0 0 0 1 0
Explanation:
concatenate all string columns into a single one, using space as a separator:
In [165]: df.select_dtypes('object').add(' ').sum(1)
Out[165]:
0 200 150 140 Jones LLC LLC
1 210 Jones 215 Alpha Co
2 90 50 95 Blue Inc
dtype: object
generate a One Hot Encode sparse matrix with selected features:
In [176]: A = (cv.fit_transform(df.select_dtypes('object').add(' ').sum(1)) != 0) * 1
In [177]: A
Out[177]:
<3x5 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
In [178]: A.A
Out[178]:
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 1],
[0, 0, 0, 1, 0]])
In [179]: cv.get_feature_names()
Out[179]: ['jones', 'llc', 'co', 'blue', 'alpha']
generate a SparseDataFrame out of it:
In [174]: r = pd.SparseDataFrame((cv.fit_transform(df.select_dtypes('object').add(' ').sum(1)) != 0) * 1,
...: df.index,
...: cv.get_feature_names(),
...: default_fill_value=0)
...:
...:
In [175]: r
Out[175]:
jones llc co blue alpha
0 1 1 0 0 0
1 1 0 1 0 1
2 0 0 0 1 0
Upvotes: 2
Reputation: 4744
Here we use pandas built-in str
function contains
, along with apply
and then bring it all together with any
as follows,
search_string = 'Jones'
df[search_string] = (df.apply(lambda x: x.str.contains(search_string))
.any(axis=1).astype(int))
df
Out[2]:
Feb Jan Mar account Jones
0 200 150 140 Jones LLC 1
1 210 Jones 215 Alpha Co 1
2 90 50 95 Blue Inc 0
This can be easily extended as contains
uses regular expressions to do the matching. It also has a case arg so that you can make it case-insensitive and search for both Jones
and jones
.
In order to loop over a list of search words we need to make the following changes. By storing each search result (a Series
) in a list, we use the list to join the series together in to DataFrame
. We do this because we don't want to search new columns for the new search_string,
df_list = []
for search_string in ['Jones', 'Co', 'Alpha']:
#use above method but rename the series instead of setting to
# a columns. The append to a list.
df_list.append(df.apply(lambda x: x.str.contains(search_string))
.any(axis=1)
.astype(int)
.rename(search_string))
#concatenate the list of series into a DataFrame with the original df
df = pd.concat([df] + df_list, axis=1)
df
Out[5]:
Feb Jan Mar account Jones Co Alpha
0 200 150 140 Jones LLC 1 0 0
1 210 Jones 215 Alpha Co 1 1 1
2 90 50 95 Blue Inc 0 0 0
Upvotes: 3