LeMarque
LeMarque

Reputation: 783

removing multiple character combinations from words in each rows of a pandas dataframe

I have a large dataset of something similar like in the below sample data:

import pandas as pd

raw_data = {'ID': [1,2,3,4,5,6,7,8,9,10], 
        'body': ['FITrnXS$100', '$1000rnReason', 'rnIf', 'bevlauedrnrnnext', 'obccrnrnnoncrnrnactionrn', 'rnrnnotification', 'insdrnrnnon', 'rnrnupdated', 'rnreason', 'rnrnrnLOR']}
df = pd.DataFrame(raw_data, columns = ['ID', 'body'])
df

What I wanted to do is to use the list of words that i have defined as in below code:

remove_string = ['rn', 'rnr', 'rnrn', 'rnrnrn']

and then use the above remove_string to remove these from words in the text ('body' column of dataframe) if found.

Below table will be expected result

ID  body            cleaned_txt    Removed_string
1   FITrnXS$100     FIT XS$100     rn
2   $1000rnReason   $1000 Reason      rn
3   rnIf            IF               rn
4   bevlauedrnrnnext    bevalue next    rnrn
5   obccrnrnnoncrnrnactionrn    obcc nonc actionrn  rnrn
6   rnrnnotification    notification    rnrn
7   insdrnrnnon insd non    rnrn
8   rnrnupdated updated rnrn
9   rnreason    reason  rn
10  rnrnrnLOR   LOR rnrnrn

enter image description here

Unfortuinately i was trying to convert the data into a list as follows:

text = df['body'].tolist()

and then apply the function like this:

def clnTxt(text):
    txt = [item.replace('rnrn', '\n') for item in text]
    txt = [item.replace('nrn', '\n') for item in txt]
    return txt

text = clnTxt(text)

Which is not the correct way. I should be able to apply function directly on the dataframe and hence for each row, operation of the cleaning is performed and also the other columns are created.

just looking for a better solution to my issue.

Upvotes: 1

Views: 582

Answers (3)

Sandeep_Rao
Sandeep_Rao

Reputation: 101

We need to use a bit of regex for this hacky solution:

import pandas as pd


import re    


raw_data = {'ID': [1,2,3,4,5,6,7,8,9,10], 
    'body': ['FITrnXS$100', '$1000rnReason', 'rnIf', 'bevlauedrnrnnext', 
    'obccrnrnnoncrnrnactionrn', 'rnrnnotification', 'insdrnrnnon', 
    'rnrnupdated', 
    'rnreason', 'rnrnrnLOR']}


df = pd.DataFrame(raw_data, columns = ['ID', 'body'])
remove_string = ['rn']
cleaned_text=[i.replace("rn", " ") for i in df.body]
removed_string_len=[len(re.findall('\s',i)) for i in cleaned_text]
removed_final=[i*'rn' for i in removed_string_len]
cleaned_final=[i.strip() for i in cleaned_text]
df['cleaned_txt']=cleaned_final
df['Removed']=removed_final
df

Hope this helps!

Upvotes: 2

jezrael
jezrael

Reputation: 863166

Because longer strings contains shorter strings, order is important. So loop by inverse list by [::-1] and use Series.str.extract values to new columns, then use Series.str.replace with same column.

Last use DataFrame.dot for combine all extracted values with separator to new column:

remove_string = ['rn', 'rnr', 'rnrn', 'rnrnrn']

df['cleaned_txt'] = df['body']
for i in remove_string[::-1]:
    df[i] = df['cleaned_txt'].str.extract('({})'.format(i)) 
    df['cleaned_txt'] = df['cleaned_txt'].str.replace(i, '')

df['Removed_string'] = (df[remove_string].notna()
                                         .dot(pd.Index(remove_string) + ',')
                                         .str.strip(','))
df = df.drop(remove_string, axis=1)
print (df)
   ID                      body     cleaned_txt Removed_string
0   1               FITrnXS$100       FITXS$100             rn
1   2             $1000rnReason     $1000Reason             rn
2   3                      rnIf              If             rn
3   4          bevlauedrnrnnext    bevlauednext           rnrn
4   5  obccrnrnnoncrnrnactionrn  obccnoncaction        rn,rnrn
5   6          rnrnnotification    notification           rnrn
6   7               insdrnrnnon         insdnon           rnrn
7   8               rnrnupdated         updated           rnrn
8   9                  rnreason           eason            rnr
9  10                 rnrnrnLOR             LOR         rnrnrn

If need replace by whitespace:

remove_string = ['rn', 'rnr', 'rnrn', 'rnrnrn']

df['cleaned_txt'] = df['body']
for i in remove_string[::-1]:
    df[i] = df['cleaned_txt'].str.extract('({})'.format(i)) 
    df['cleaned_txt'] = df['cleaned_txt'].str.replace(i, ' ')

df['Removed_string'] = (df[remove_string].notna()
                                         .dot(pd.Index(remove_string) + ',')
                                         .str.strip(','))
df = df.drop(remove_string, axis=1)
print (df)

   ID                      body        cleaned_txt Removed_string
0   1               FITrnXS$100         FIT XS$100             rn
1   2             $1000rnReason       $1000 Reason             rn
2   3                      rnIf                 If             rn
3   4          bevlauedrnrnnext      bevlaued next           rnrn
4   5  obccrnrnnoncrnrnactionrn  obcc nonc action         rn,rnrn
5   6          rnrnnotification       notification           rnrn
6   7               insdrnrnnon           insd non           rnrn
7   8               rnrnupdated            updated           rnrn
8   9                  rnreason              eason            rnr
9  10                 rnrnrnLOR                LOR         rnrnrn

EDIT1:

#dictioanry for replace
remove_string = {"rn":" ", "rnr":"\n", "rnrn":"\n", "rnrnrn":"\n"}

#sorting by keys for list of tuples 
rem = sorted(remove_string.items(), key=lambda s: len(s[0]), reverse=True)
print (rem)
[('rnrnrn', '\n'), ('rnrn', '\n'), ('rnr', '\n'), ('rn', ' ')]

df['cleaned_txt'] = df['body']
for i, j in rem:
    df[i] = df['cleaned_txt'].str.extract('({})'.format(i)) 
    df['cleaned_txt'] = df['cleaned_txt'].str.replace(i, j)

cols = list(remove_string.keys())
df['Removed_string'] = (df[cols].notna().dot(pd.Index(cols) + ',')
                                        .str.strip(','))
df = df.drop(remove_string, axis=1)
print (df)
   ID                      body          cleaned_txt Removed_string
0   1               FITrnXS$100           FIT XS$100             rn
1   2             $1000rnReason         $1000 Reason             rn
2   3                      rnIf                   If             rn
3   4          bevlauedrnrnnext       bevlaued\nnext           rnrn
4   5  obccrnrnnoncrnrnactionrn  obcc\nnonc\naction         rn,rnrn
5   6          rnrnnotification       \nnotification           rnrn
6   7               insdrnrnnon            insd\nnon           rnrn
7   8               rnrnupdated            \nupdated           rnrn
8   9                  rnreason              \neason            rnr
9  10                 rnrnrnLOR                \nLOR         rnrnrn

Upvotes: 3

ComplicatedPhenomenon
ComplicatedPhenomenon

Reputation: 4199

import pandas as pd
import re
raw_data = {'ID': [1,2,3,4,5,6,7,8,9,10], 
        'body': ['FITrnXS$100', '$1000rnReason', 'rnIf', 'bevlauedrnrnnext', 'obccrnrnnoncrnrnactionrn', 'rnrnnotification', 'insdrnrnnon', 'rnrnupdated', 'rnreason', 'rnrnrnLOR']}
removed_string =['rn', 'rnr', 'rnrn', 'rnrnrn']
removed_string =  removed_string[::-1]

raw_data['Removed_string'] = []
raw_data['cleaned_txt'] = []

for i in raw_data['body']:
    j = 0
    m = removed_string[j]
    while True:
        m = removed_string[j]
        pattern = re.compile(m)
        n = pattern.findall(i)
        if len(n) != 0: 
            raw_data['cleaned_txt'].append(i.replace(m, ' '))
            raw_data['Removed_string'].append(m)
            break
        j += 1
df = pd.DataFrame(raw_data, columns = ['ID', 'body', 'cleaned_txt', 'Removed_string'])
print(df)

Output is

   ID                      body         cleaned_txt Removed_string
0   1               FITrnXS$100          FIT XS$100             rn
1   2             $1000rnReason        $1000 Reason             rn
2   3                      rnIf                  If             rn
3   4          bevlauedrnrnnext       bevlaued next           rnrn
4   5  obccrnrnnoncrnrnactionrn  obcc nonc actionrn           rnrn
5   6          rnrnnotification        notification           rnrn
6   7               insdrnrnnon            insd non           rnrn
7   8               rnrnupdated             updated           rnrn
8   9                  rnreason               eason            rnr
9  10                 rnrnrnLOR                 LOR         rnrnrn

Upvotes: 1

Related Questions