the_reaper
the_reaper

Reputation: 83

Replacing values in a pandas dataframe with values from other column with a certain condition

Hello I have the following table

value description
2 Aron
2 Aaron
2 Aronn
3 John
3 Jhon
4 Liam

I wanna get the value column as it is but for the same values replace all the values in the description field with the first description for that value. Something like this

value description
2 Aron
2 Aron
2 Aron
3 John
3 John
4 Liam

Upvotes: 0

Views: 42

Answers (3)

MDR
MDR

Reputation: 2670

I may be way off, but it seems like you are trying to 'fix' first names. If so maybe the code below is food for thought.

It takes a frame like...

   value description
0      2        Aron
1      2       Aaron
2      2       Aronn
3      3        John
4      3        Jhon
5      4        Liam

...and outputs...

   value description
0      2       Aaron
1      2       Aaron
2      2       Aaron
3      3        John
4      3        John
5      4        Liam

This is based on getting a good list of known first names (OK, first list I found from a quick Google :o) and using fuzzywuzzy to try and match what is already in the column description. Maybe we'll fix some typos!

Code:

from fuzzywuzzy import fuzz, process
import pandas as pd
import requests #not always required but needed for some sites so doing it this way

# create a dataframe with a column of names in 'description'
df = pd.DataFrame({'value': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4},
 'description': {0: 'Aron',
  1: 'Aaron',
  2: 'Aronn',
  3: 'John',
  4: 'Jhon',
  5: 'Liam'}})

# find a list somewhere on the t'internet of first names (Liam, John, etc. looks like English)
url = r'https://www.britishbabynames.com/blog/data-analysis-and-statistics/'

# this takes about 20 seconds!
page = requests.get(url)
table = pd.read_html(page.text)

# tidy up the table we downloaded
df_bnames = table[0][2:]
df_bnames = df_bnames[1]
df_bnames = df_bnames.dropna()
df_bnames = df_bnames.reset_index(drop=True)

# put the names from the web into a list
source_names = list(df_bnames)

# fuzzywuzzy function - magic kung fu
def match_names(name, list_names, min_score=0):
    max_score = -1
    max_name = ''
    for x in list_names:
        score = fuzz.ratio(name, x)
        if (score > min_score) & (score > max_score):
            max_name = x
            max_score = score
    return (max_name, max_score)

# from the original frame get the unique names as a list
uniquenames = list(df.description.unique())

# score limit - may need to adjust up or down.  Depends what's in the dict below
score = 50

#using fuzzywuzzy compare one name list to another
names = []
for x in uniquenames:
    match = match_names(x, source_names, score)
    if match[1] >= score:
        name = (str(x), str(match[0]))
        names.append(name)

# we have a dictionary of {...'Jhon': 'John'...} now.  
name_dict = dict(names)

# output the original frame
print(df,'\n')

# swap the values in 'description' with the 'better' names in the dictionary
df = df.replace({'description': name_dict})

# see the changes
print(df)

So the result is...

   value description
0      2       Aaron
1      2       Aaron
2      2       Aaron
3      3        John
4      3        John
5      4        Liam

Jhon is fixed and the Aron and Aronn are replaced.

Upvotes: 0

not_speshal
not_speshal

Reputation: 23146

Use duplicated() and ffill():

>>> df.where(~df["value"].duplicated()).ffill()
   value description
0    2.0        Aron
1    2.0        Aron
2    2.0        Aron
3    3.0        John
4    3.0        John
5    4.0        Liam

Upvotes: 2

ThePyGuy
ThePyGuy

Reputation: 18406

Just group the dataframe by value, then call transform, and pass first as the function to take the first value in Description column within the group.

>>> df['description']=df.groupby('value')['description'].transform('first')

   value description
0      2        Aron
1      2        Aron
2      2        Aron
3      3        John
4      3        John
5      4        Liam

Upvotes: 1

Related Questions