Dela
Dela

Reputation: 115

Data frame text filtering with text

I need some help with running some filter on some data. I have a a data set made up of text. And i also have a list of words. I would like to filter each row of my data such that the remaining text in the rows will be made up of only words in the list object

words

(cell, CDKs, lung, mutations monomeric, Casitas, Background, acquired, evidence, kinases, small, evidence, Oncogenic )


data

ID  Text

0   Cyclin-dependent kinases CDKs regulate a 

1   Abstract Background Non-small cell lung  

2   Abstract Background Non-small cell lung 

3   Recent evidence has demonstrated that acquired

4   Oncogenic mutations in the monomeric Casitas  

so after my filter i would like the data-frame to look like this

data

ID  Text

0    kinases CDKs  

1   Background cell lung  

2   Background small cell lung 

3   evidence acquired

4   Oncogenic mutations monomeric Casitas  

I tried using the iloc and similar functions but I dont seem to get it. any help with that?

Upvotes: 0

Views: 48

Answers (2)

user3483203
user3483203

Reputation: 51165

You can simply use apply() along with a simple list comprehension:

>>> df['Text'].apply(lambda x: ' '.join([i for i in x.split() if i in words]))
0                             kinases CDKs
1                     Background cell lung
2                     Background cell lung
3                        evidence acquired
4    Oncogenic mutations monomeric Casitas

Also, I made words a set to improve performance (O(1) average lookup time), I recommend you do the same.

Upvotes: 4

Aaron Brock
Aaron Brock

Reputation: 4536

I'm not certain this is the most elegant of solutions, but you could do:

to_remove = ['foo', 'bar']
df = pd.DataFrame({'Text': [
    'spam foo& eggs', 
    'foo bar eggs bacon and lettuce', 
    'spam and foo eggs'
]})

df['Text'].str.replace('|'.join(to_remove), '')

Upvotes: 1

Related Questions