AlSub
AlSub

Reputation: 1155

remove typos from a dictionary of dataframes

I am trying to remove specific typos from a dictionary of dataframes, which looks like this:

import pandas as pd

data = {'dataframe_1':pd.DataFrame({'col1': ['John', 'Ashley'], 'col2': ['+10', '-1']}), 'dataframe_2':pd.DataFrame({'col3': ['Italy', 'Brazil', 'Japan'], 'col4': ['Milan', 'Rio do Jaineiro', 'Tokio'], 'percentage':['+95%', '≤0%', '80%+']})}

The function remove_typos() is used to remove specific typos, however when applied it returns a corrupted dataframe.

def remove_typos(string):

    # remove '+' and '≤'
    string=string.replace('+', '')
    string=string.replace('≤', '')
    
    return string

# store remove_typos() output in a dictionary of dataframes 
cleaned_df = pd.concat(data.values()).pipe(remove_typos)

Console Output:

#   col1    col2    col3    col4    percentage
#0  John    +10 NaN NaN NaN
#1  Ashley  -1  NaN NaN NaN
#0  NaN NaN Italy   Milan   +95%
#1  NaN NaN Brazil  Rio do Jaineiro ≤0%
#2  NaN NaN Japan   Tokio   80%+

The idea is that the function returns a cleaned df where each dataframe is represented by a dictionary key:

data['dataframe_1']

#   col1    col2
#0  John    10
#1  Ashley  -1

Is there any other way to apply this function over a dict of df's?

Upvotes: 1

Views: 230

Answers (2)

anky
anky

Reputation: 75080

There is no harm using a loop in a dictionary (not a dataframe)

data1 = {}
for k,v in data.items():
    v1 = v.select_dtypes("O")
    v = v.assign(**v1.applymap(remove_typos))
    data1[k] = v

print(data1)

{'dataframe_1':      col1 col2
0    John   10
1  Ashley   -1, 'dataframe_2':      col3             col4 percentage
0   Italy            Milan        95%
1  Brazil  Rio do Jaineiro         0%
2   Japan            Tokio        80%}

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71689

We can replace the values inside a dict comprehension

data = {k: v.replace([r'\+', '≤'], '', regex=True) for k, v in data.items()}

>>> data['dataframe_1']

     col1 col2
0    John   10
1  Ashley   -1

>>> data['dataframe_2']

     col3             col4 percentage
0   Italy            Milan        95%
1  Brazil  Rio do Jaineiro         0%
2   Japan            Tokio        80%

Upvotes: 3

Related Questions