Reputation: 115
I need some help with running some filter on some data. I have a a data set made up of text. And i also have a list of words. I would like to filter each row of my data such that the remaining text in the rows will be made up of only words in the list object
words
(cell, CDKs, lung, mutations monomeric, Casitas, Background, acquired, evidence, kinases, small, evidence, Oncogenic )
data
ID Text
0 Cyclin-dependent kinases CDKs regulate a
1 Abstract Background Non-small cell lung
2 Abstract Background Non-small cell lung
3 Recent evidence has demonstrated that acquired
4 Oncogenic mutations in the monomeric Casitas
so after my filter i would like the data-frame to look like this
data
ID Text
0 kinases CDKs
1 Background cell lung
2 Background small cell lung
3 evidence acquired
4 Oncogenic mutations monomeric Casitas
I tried using the iloc
and similar functions but I dont seem to get it. any help with that?
Upvotes: 0
Views: 48
Reputation: 51165
You can simply use apply()
along with a simple list comprehension:
>>> df['Text'].apply(lambda x: ' '.join([i for i in x.split() if i in words]))
0 kinases CDKs
1 Background cell lung
2 Background cell lung
3 evidence acquired
4 Oncogenic mutations monomeric Casitas
Also, I made words a set
to improve performance (O(1)
average lookup time), I recommend you do the same.
Upvotes: 4
Reputation: 4536
I'm not certain this is the most elegant of solutions, but you could do:
to_remove = ['foo', 'bar']
df = pd.DataFrame({'Text': [
'spam foo& eggs',
'foo bar eggs bacon and lettuce',
'spam and foo eggs'
]})
df['Text'].str.replace('|'.join(to_remove), '')
Upvotes: 1