Data frame text filtering with text

Question

I need some help with running some filter on some data. I have a a data set made up of text. And i also have a list of words. I would like to filter each row of my data such that the remaining text in the rows will be made up of only words in the list object

words

(cell, CDKs, lung, mutations monomeric, Casitas, Background, acquired, evidence, kinases, small, evidence, Oncogenic )


data

ID  Text

0   Cyclin-dependent kinases CDKs regulate a 

1   Abstract Background Non-small cell lung  

2   Abstract Background Non-small cell lung 

3   Recent evidence has demonstrated that acquired

4   Oncogenic mutations in the monomeric Casitas

so after my filter i would like the data-frame to look like this

data

ID  Text

0    kinases CDKs  

1   Background cell lung  

2   Background small cell lung 

3   evidence acquired

4   Oncogenic mutations monomeric Casitas

I tried using the iloc and similar functions but I dont seem to get it. any help with that?

user3483203 · Accepted Answer

You can simply use apply() along with a simple list comprehension:

>>> df['Text'].apply(lambda x: ' '.join([i for i in x.split() if i in words]))
0                             kinases CDKs
1                     Background cell lung
2                     Background cell lung
3                        evidence acquired
4    Oncogenic mutations monomeric Casitas

Also, I made words a set to improve performance (O(1) average lookup time), I recommend you do the same.

Data frame text filtering with text

Answers (2)

Related Questions