Reputation: 2015
I am having a table with 87 million rows and 5 columns. I have a separate file too, with around 3500 words. I want to check for the words in .txt file and check for that word in 4 columns of the table for each row. If that word is present in any of the columns, then I want to remove those rows. This would help me to reduce the number of rows here considerably. The following is the code I am using,
bad_words = pd.read_csv('badwords.txt')
bad_words.dtypes
words object
dtype: object
bad_words
words
0 word1
1 word3
2 word5
3 word13
4 word16
data
s.no column1 column2 column3 column4
1 aaaword1b aaaword2b aaaword3b aaaword4b
2 aaaword5b aaaword6b aaaword7b aaaword8b
3 aaaword9b aaaword10b aaaword11b aaaword12b
4 aaaword13b aaaword14b aaaword15b aaaword16b
5 aaaword17b aaaword18b aaaword19b aaaword20b
I want to remove the rows that contain words from the bad word document. The output of this should be,
data
s.no column1 column2 column3 column4
3 aaaword9b aaaword10b aaaword11b aaaword12b
5 aaaword17b aaaword18b aaaword19b aaaword20b
I am trying to do something like,
data[(data['column1'].str.contains("word1|word3|word5|word13|word16")==False)|
(data['column2'].str.contains("word1|word3|word5|word13|word16")==False)|
(data['column3'].str.contains("word1|word3|word5|word13|word16")==False)]
But I am not sure whether we can do it for the entire 3500 words. Also not sure whether this is the efficient way to do for 87 million rows.
Updated the question with string patterns rather that the direct words. Sorry for the bad requirement earlier.
Can anybody suggest me a better way to do this?
Thanks
Upvotes: 3
Views: 376
Reputation: 294488
I altered your example because word1
is technically in word11
and word12
and I don't think that's what you meant.
from StringIO import StringIO
import pandas as pd
text_bad_words = """ words
0 _word1_
1 _word3_
2 _word5_
3 _word13_
4 _word16_"""
text_data = """s.no column1 column2 column3 column4
1 aaa_word1_b aaa_word2_b aaa_word3_b aaa_word4_b
2 aaa_word5_b aaa_word6_b aaa_word7_b aaa_word8_b
3 aaa_word9_b aaa_word10_b aaa_word11_b aaa_word12_b
4 aaa_word13_b aaa_word14_b aaa_word15_b aaa_word16_b
5 aaa_word17_b aaa_word18_b aaa_word19_b aaa_word20_b"""
bad_words = pd.read_csv(
StringIO(text_bad_words), squeeze=True, index_col=0, delim_whitespace=True)
data = pd.read_csv(
StringIO(text_data), squeeze=True, index_col=0, delim_whitespace=True)
I'll use regex
and contains
regex = r'|'.join(bad_words)
regex
'_word1_|_word3_|_word5_|_word13_|_word16_'
Create boolean mask
mask = data.stack().str.contains(regex).unstack().any(1)
mask
s.no
1 True
2 True
3 False
4 True
5 False
dtype: bool
data.loc[~mask]
Upvotes: 1
Reputation: 215057
You can use apply
method to check by row and create a vector indicating whether the row contains anything in the bad_words
using the isin
method and then subset the original data frame based on the logic vector returned:
data[~data.apply(lambda row: row.isin(bad_words.words).any(), axis = 1)]
#s.no column1 column2 column3 column4
#2 3 word9 word10 word11 word12
#4 5 word17 word18 word19 word20
For the updated question, here is an option that might work depending on your actual data:
data[~data.apply(lambda row: bad_words.words.apply(lambda w: row.str.contains(w + "(?=\D)").any()).any(), axis = 1)]
# sno column1 column2 column3 column4
#2 3 aaaword9b aaaword10b aaaword11b aaaword12b
#4 5 aaaword17b aaaword18b aaaword19b aaaword20b
Upvotes: 1