Reputation: 783
I have a large dataset of something similar like in the below sample data:
import pandas as pd
raw_data = {'ID': [1,2,3,4,5,6,7,8,9,10],
'body': ['FITrnXS$100', '$1000rnReason', 'rnIf', 'bevlauedrnrnnext', 'obccrnrnnoncrnrnactionrn', 'rnrnnotification', 'insdrnrnnon', 'rnrnupdated', 'rnreason', 'rnrnrnLOR']}
df = pd.DataFrame(raw_data, columns = ['ID', 'body'])
df
What I wanted to do is to use the list of words that i have defined as in below code:
remove_string = ['rn', 'rnr', 'rnrn', 'rnrnrn']
and then use the above remove_string to remove these from words in the text ('body' column of dataframe) if found.
Below table will be expected result
ID body cleaned_txt Removed_string
1 FITrnXS$100 FIT XS$100 rn
2 $1000rnReason $1000 Reason rn
3 rnIf IF rn
4 bevlauedrnrnnext bevalue next rnrn
5 obccrnrnnoncrnrnactionrn obcc nonc actionrn rnrn
6 rnrnnotification notification rnrn
7 insdrnrnnon insd non rnrn
8 rnrnupdated updated rnrn
9 rnreason reason rn
10 rnrnrnLOR LOR rnrnrn
Unfortuinately i was trying to convert the data into a list as follows:
text = df['body'].tolist()
and then apply the function like this:
def clnTxt(text):
txt = [item.replace('rnrn', '\n') for item in text]
txt = [item.replace('nrn', '\n') for item in txt]
return txt
text = clnTxt(text)
Which is not the correct way. I should be able to apply function directly on the dataframe and hence for each row, operation of the cleaning is performed and also the other columns are created.
just looking for a better solution to my issue.
Upvotes: 1
Views: 582
Reputation: 101
We need to use a bit of regex for this hacky solution:
import pandas as pd
import re
raw_data = {'ID': [1,2,3,4,5,6,7,8,9,10],
'body': ['FITrnXS$100', '$1000rnReason', 'rnIf', 'bevlauedrnrnnext',
'obccrnrnnoncrnrnactionrn', 'rnrnnotification', 'insdrnrnnon',
'rnrnupdated',
'rnreason', 'rnrnrnLOR']}
df = pd.DataFrame(raw_data, columns = ['ID', 'body'])
remove_string = ['rn']
cleaned_text=[i.replace("rn", " ") for i in df.body]
removed_string_len=[len(re.findall('\s',i)) for i in cleaned_text]
removed_final=[i*'rn' for i in removed_string_len]
cleaned_final=[i.strip() for i in cleaned_text]
df['cleaned_txt']=cleaned_final
df['Removed']=removed_final
df
Hope this helps!
Upvotes: 2
Reputation: 863166
Because longer strings contains shorter strings, order is important. So loop by inverse list by [::-1]
and use Series.str.extract
values to new columns, then use Series.str.replace
with same column.
Last use DataFrame.dot
for combine all extracted values with separator to new column:
remove_string = ['rn', 'rnr', 'rnrn', 'rnrnrn']
df['cleaned_txt'] = df['body']
for i in remove_string[::-1]:
df[i] = df['cleaned_txt'].str.extract('({})'.format(i))
df['cleaned_txt'] = df['cleaned_txt'].str.replace(i, '')
df['Removed_string'] = (df[remove_string].notna()
.dot(pd.Index(remove_string) + ',')
.str.strip(','))
df = df.drop(remove_string, axis=1)
print (df)
ID body cleaned_txt Removed_string
0 1 FITrnXS$100 FITXS$100 rn
1 2 $1000rnReason $1000Reason rn
2 3 rnIf If rn
3 4 bevlauedrnrnnext bevlauednext rnrn
4 5 obccrnrnnoncrnrnactionrn obccnoncaction rn,rnrn
5 6 rnrnnotification notification rnrn
6 7 insdrnrnnon insdnon rnrn
7 8 rnrnupdated updated rnrn
8 9 rnreason eason rnr
9 10 rnrnrnLOR LOR rnrnrn
If need replace by whitespace:
remove_string = ['rn', 'rnr', 'rnrn', 'rnrnrn']
df['cleaned_txt'] = df['body']
for i in remove_string[::-1]:
df[i] = df['cleaned_txt'].str.extract('({})'.format(i))
df['cleaned_txt'] = df['cleaned_txt'].str.replace(i, ' ')
df['Removed_string'] = (df[remove_string].notna()
.dot(pd.Index(remove_string) + ',')
.str.strip(','))
df = df.drop(remove_string, axis=1)
print (df)
ID body cleaned_txt Removed_string
0 1 FITrnXS$100 FIT XS$100 rn
1 2 $1000rnReason $1000 Reason rn
2 3 rnIf If rn
3 4 bevlauedrnrnnext bevlaued next rnrn
4 5 obccrnrnnoncrnrnactionrn obcc nonc action rn,rnrn
5 6 rnrnnotification notification rnrn
6 7 insdrnrnnon insd non rnrn
7 8 rnrnupdated updated rnrn
8 9 rnreason eason rnr
9 10 rnrnrnLOR LOR rnrnrn
EDIT1:
#dictioanry for replace
remove_string = {"rn":" ", "rnr":"\n", "rnrn":"\n", "rnrnrn":"\n"}
#sorting by keys for list of tuples
rem = sorted(remove_string.items(), key=lambda s: len(s[0]), reverse=True)
print (rem)
[('rnrnrn', '\n'), ('rnrn', '\n'), ('rnr', '\n'), ('rn', ' ')]
df['cleaned_txt'] = df['body']
for i, j in rem:
df[i] = df['cleaned_txt'].str.extract('({})'.format(i))
df['cleaned_txt'] = df['cleaned_txt'].str.replace(i, j)
cols = list(remove_string.keys())
df['Removed_string'] = (df[cols].notna().dot(pd.Index(cols) + ',')
.str.strip(','))
df = df.drop(remove_string, axis=1)
print (df)
ID body cleaned_txt Removed_string
0 1 FITrnXS$100 FIT XS$100 rn
1 2 $1000rnReason $1000 Reason rn
2 3 rnIf If rn
3 4 bevlauedrnrnnext bevlaued\nnext rnrn
4 5 obccrnrnnoncrnrnactionrn obcc\nnonc\naction rn,rnrn
5 6 rnrnnotification \nnotification rnrn
6 7 insdrnrnnon insd\nnon rnrn
7 8 rnrnupdated \nupdated rnrn
8 9 rnreason \neason rnr
9 10 rnrnrnLOR \nLOR rnrnrn
Upvotes: 3
Reputation: 4199
import pandas as pd
import re
raw_data = {'ID': [1,2,3,4,5,6,7,8,9,10],
'body': ['FITrnXS$100', '$1000rnReason', 'rnIf', 'bevlauedrnrnnext', 'obccrnrnnoncrnrnactionrn', 'rnrnnotification', 'insdrnrnnon', 'rnrnupdated', 'rnreason', 'rnrnrnLOR']}
removed_string =['rn', 'rnr', 'rnrn', 'rnrnrn']
removed_string = removed_string[::-1]
raw_data['Removed_string'] = []
raw_data['cleaned_txt'] = []
for i in raw_data['body']:
j = 0
m = removed_string[j]
while True:
m = removed_string[j]
pattern = re.compile(m)
n = pattern.findall(i)
if len(n) != 0:
raw_data['cleaned_txt'].append(i.replace(m, ' '))
raw_data['Removed_string'].append(m)
break
j += 1
df = pd.DataFrame(raw_data, columns = ['ID', 'body', 'cleaned_txt', 'Removed_string'])
print(df)
Output is
ID body cleaned_txt Removed_string
0 1 FITrnXS$100 FIT XS$100 rn
1 2 $1000rnReason $1000 Reason rn
2 3 rnIf If rn
3 4 bevlauedrnrnnext bevlaued next rnrn
4 5 obccrnrnnoncrnrnactionrn obcc nonc actionrn rnrn
5 6 rnrnnotification notification rnrn
6 7 insdrnrnnon insd non rnrn
7 8 rnrnupdated updated rnrn
8 9 rnreason eason rnr
9 10 rnrnrnLOR LOR rnrnrn
Upvotes: 1