haimen
haimen

Reputation: 2015

Finding words from a text document and removing the corresponding rows in dataframe - python

I am having a table with 87 million rows and 5 columns. I have a separate file too, with around 3500 words. I want to check for the words in .txt file and check for that word in 4 columns of the table for each row. If that word is present in any of the columns, then I want to remove those rows. This would help me to reduce the number of rows here considerably. The following is the code I am using,

bad_words = pd.read_csv('badwords.txt')

bad_words.dtypes

words    object
dtype: object

bad_words

        words
    0   word1
    1   word3
    2   word5
    3   word13
    4   word16
data

s.no      column1        column2        column3         column4
1         aaaword1b      aaaword2b      aaaword3b       aaaword4b
2         aaaword5b      aaaword6b      aaaword7b       aaaword8b                         
3         aaaword9b      aaaword10b     aaaword11b      aaaword12b
4         aaaword13b     aaaword14b     aaaword15b      aaaword16b
5         aaaword17b     aaaword18b     aaaword19b      aaaword20b

I want to remove the rows that contain words from the bad word document. The output of this should be,

data

s.no      column1        column2        column3         column4                       
3         aaaword9b       aaaword10b    aaaword11b      aaaword12b
5         aaaword17b      aaaword18b    aaaword19b      aaaword20b

I am trying to do something like,

data[(data['column1'].str.contains("word1|word3|word5|word13|word16")==False)| 
(data['column2'].str.contains("word1|word3|word5|word13|word16")==False)|
(data['column3'].str.contains("word1|word3|word5|word13|word16")==False)]

But I am not sure whether we can do it for the entire 3500 words. Also not sure whether this is the efficient way to do for 87 million rows.

Updated the question with string patterns rather that the direct words. Sorry for the bad requirement earlier.

Can anybody suggest me a better way to do this?

Thanks

Upvotes: 3

Views: 376

Answers (2)

piRSquared
piRSquared

Reputation: 294488

I altered your example because word1 is technically in word11 and word12 and I don't think that's what you meant.

Setup

from StringIO import StringIO
import pandas as pd

text_bad_words = """        words
    0   _word1_
    1   _word3_
    2   _word5_
    3   _word13_
    4   _word16_"""

text_data = """s.no      column1        column2        column3         column4
1         aaa_word1_b      aaa_word2_b      aaa_word3_b       aaa_word4_b
2         aaa_word5_b      aaa_word6_b      aaa_word7_b       aaa_word8_b
3         aaa_word9_b      aaa_word10_b     aaa_word11_b      aaa_word12_b
4         aaa_word13_b     aaa_word14_b     aaa_word15_b      aaa_word16_b
5         aaa_word17_b     aaa_word18_b     aaa_word19_b      aaa_word20_b"""

bad_words = pd.read_csv(
    StringIO(text_bad_words), squeeze=True, index_col=0, delim_whitespace=True)
data = pd.read_csv(
    StringIO(text_data), squeeze=True, index_col=0, delim_whitespace=True)

Solution

I'll use regex and contains

regex = r'|'.join(bad_words)
regex

'_word1_|_word3_|_word5_|_word13_|_word16_'

Create boolean mask

mask = data.stack().str.contains(regex).unstack().any(1)
mask

s.no
1     True
2     True
3    False
4     True
5    False
dtype: bool

data.loc[~mask]

enter image description here

Upvotes: 1

akuiper
akuiper

Reputation: 215057

You can use apply method to check by row and create a vector indicating whether the row contains anything in the bad_words using the isin method and then subset the original data frame based on the logic vector returned:

data[~data.apply(lambda row: row.isin(bad_words.words).any(), axis = 1)]

#s.no   column1 column2 column3 column4
#2  3     word9  word10  word11  word12
#4  5    word17  word18  word19  word20

For the updated question, here is an option that might work depending on your actual data:

data[~data.apply(lambda row: bad_words.words.apply(lambda w: row.str.contains(w + "(?=\D)").any()).any(), axis = 1)]


# sno      column1     column2     column3     column4
#2  3    aaaword9b  aaaword10b  aaaword11b  aaaword12b
#4  5   aaaword17b  aaaword18b  aaaword19b  aaaword20b

Upvotes: 1

Related Questions