Avinash Clinton
Avinash Clinton

Reputation: 543

How to delete words from a dataframe column that are present in dictionary in Pandas

An extension to : Removing list of words from a string

I have following dataframe and I want to delete frequently occuring words from df.name column:

df :

name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh  
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark

I'm creating a new dataframe with words and their frequency with following code :

df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]

which will result in

df2 :

word    freq
Clinton 4
Bill    3
James   3
Clark   3

Then I'm converting it into a dictionary with following code snippet :

    d = dict(zip(df['word'], df['freq']))

Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :

def check_thresh_word(merc,d):
    m = merc.split(' ')
    for i in range(len(m)):
            if m[i] in d.keys():
                return False
    else:
        return True

def rm_freq_occurences(merc,d):
    if check_thresh_word(merc,d) == False:
        nwords = merc.split(' ')
        rwords = [word for word in nwords if word not in d.keys()]
        m = ' '.join(rwords)
    else:
        m=merc
    return m

df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))

But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100. So above code takes lots of time to run because of complex search. Is there any effiecient way to make it faster??

Following is a desired output :

name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam

Thanks in advance!!!!!!!

Upvotes: 3

Views: 5382

Answers (1)

jezrael
jezrael

Reputation: 862591

Use replace by regex created by joined all values of column word, last strip traling whitespaces:

data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()

Another solution is add \s* for select zero or more whitespaces:

pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*

data.name = data.name.replace(pat, '', regex=True)

print (data)
          name
0       Hayden
1         Rock
2        Gates
3       Vishal
4     Cameroon
5        Micky
6      Michael
7   Tony Waugh
8          Tom
9          Tom
10     Avinash
11     Shreyas
12      Ramesh
13        Adam

Upvotes: 2

Related Questions