Reputation: 3
I have a data frame A which has a column called text which are long strings. I want to keep the rows of 'A' that have any string that are in a list 'author_id' of strings.
A data frame:
Dialogue Index author_id text
10190 0 573660 How is that even possible?
10190 1 23442 @573660 I do apologize.
10190 2 573661 @AAA do you still have the program for free checked bags?
author_id list:
[573660, 573678, 5736987]
So since 573660 is in the author_id list and is in the text column of A, my expected outcome would be to keep only the second row of the data frame A:
Dialogue Index author_id text
10190 1 23442 @573660 I do apologize.
The most naive way of solving I can think of would be to do:
new_A=pd.DataFrame()
for id in author_id:
new_A.append(A[A['text'].str.contains(id, na=False)]
but this will take a long time.
So I come up with this solution:
[id in text for id in author_id for text in df['text'] ]
But this doesn't work for subsetting the data frame because I obtain true false values for all the strings in df['text'] for each author id.
So I created a new column in the data frame which is a combination of Dialogue and Index so I can return that in the list comprehension but it gave an error I don't know how to interpret.
A["DialogueIndex"]= df["Dialogue"].map(str) + df["Index"]
newA = [did for did in df["DialogueIndex"] for id in author_id if df['text'].str.contains(id) ]
error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Please help.
Upvotes: 0
Views: 1599
Reputation: 13725
You could use apply and then the check if each item in the author_id_list
is in the text
df[df.text.apply(lambda x: any(str(e) in x for e in author_id_list))]
Dialogue Index author_id text
1 10190 1 23442 @573660 I do apologize.
There may be a faster way to do this, but I believe this will get you the answer you are looking for
Upvotes: 0
Reputation: 59579
Simply use str.contains
to see if text
ever contains any of the authors in your specified list (by joining all of the authors with |
)
import pandas as pd
df = pd.DataFrame({
'Dialogue': [10190, 10190, 10190],
'Index': [0,1,2],
'author_id': [573660,23442,573661],
'text': ['How is that even possible?',
'@573660 I do apologize.',
'@AAA do you still have the program for free checked bags?']
})
author_id_list = [573660, 573678, 5736987]
df.text.str.contains('|'.join(list(map(str, author_id_list))))
#0 False
#1 True
#2 False
#Name: text, dtype: bool
Then you can just mask the original DataFrame
:
df[df.text.str.contains('|'.join(list(map(str, author_id_list))))]
# Dialogue Index author_id text
#1 10190 1 23442 @573660 I do apologize.
If your author_id_list
is already strings, then you can get rid of the list(map(...))
and just join the original list.
Upvotes: 1