Reputation: 29
I have the following csv file:
start_date,end_date,pollster,sponsor,sample_size,population,party,subject,tracking,text,approve,disapprove,url
2020-02-02,2020-02-04,YouGov,Economist,1500,a,all,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,42,29,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,376,a,R,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,75,6,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,523,a,D,Trump,TRUE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,21,51,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-02,2020-02-04,YouGov,Economist,599,a,I,Trump,,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,39,25,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
2020-02-07,2020-02-09,Morning Consult,"",2200,a,all,Trump,TURE,Do you approve or disapprove of the job each of the following is doing in handling the spread of coronavirus in the United States? President Donald Trump,57,22,https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf
And I want to find all the rows where the column "text" contains both the word "Trump" and "coronavirus"
I am using str.contains()
approval_polls[approval_polls.text.str.contains("Trump", "coronavirus")]
It seems as I was getting the correct output, but i am not sure if str.contains() can take two words as parameters.
Can anyone help me with that?
Output:
start_date end_date pollster sponsor sample_size population party subject tracking text approve disapprove url
0 2020-02-02 2020-02-04 YouGov Economist 1500.0 a all Trump FALSE Do you approve or disapprove of Donald Trump’s... 42.0 29.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
1 2020-02-02 2020-02-04 YouGov Economist 376.0 a R Trump FALSE Do you approve or disapprove of Donald Trump’s... 75.0 6.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
2 2020-02-02 2020-02-04 YouGov Economist 523.0 a D Trump FALSE Do you approve or disapprove of Donald Trump’s... 21.0 51.0 https://d25d2506sfb94s.cloudfront.net/cumulus_...
Upvotes: 0
Views: 1499
Reputation: 41
You can use regex to do this, but in a somewhat crafted way since we don't have a direct representative of the "AND" operator in regex.
import re
approval_polls[approval_polls.text.str.contains('(?=.*trump)(?=.*coronavirus)', regex=True, flags=re.IGNORECASE)]
Upvotes: 2
Reputation: 151
In your example case all rows contain both keywords, so you should get all five rows returned.
With the function call contains('Trump', 'coronavirus')
you get all rows that have 'Trump' OR 'coronavirus' in its text column. To get only columns that contain 'Trump' AND 'coronavirus' you can use the following
df[df['text'].str.contains('Trump') & df['text'].str.contains('coronavirus')]
Or you could use a regular expression, e.g.,
df[df['text'].str.contains(r'^(?=.*Trump)(?=.*coronavirus)')]
Upvotes: 2