Reputation: 83
Hello I have the following table
value | description |
---|---|
2 | Aron |
2 | Aaron |
2 | Aronn |
3 | John |
3 | Jhon |
4 | Liam |
I wanna get the value column as it is but for the same values replace all the values in the description field with the first description for that value. Something like this
value | description |
---|---|
2 | Aron |
2 | Aron |
2 | Aron |
3 | John |
3 | John |
4 | Liam |
Upvotes: 0
Views: 42
Reputation: 2670
I may be way off, but it seems like you are trying to 'fix' first names. If so maybe the code below is food for thought.
It takes a frame like...
value description
0 2 Aron
1 2 Aaron
2 2 Aronn
3 3 John
4 3 Jhon
5 4 Liam
...and outputs...
value description
0 2 Aaron
1 2 Aaron
2 2 Aaron
3 3 John
4 3 John
5 4 Liam
This is based on getting a good list of known first names (OK, first list I found from a quick Google :o) and using fuzzywuzzy to try and match what is already in the column description
. Maybe we'll fix some typos!
Code:
from fuzzywuzzy import fuzz, process
import pandas as pd
import requests #not always required but needed for some sites so doing it this way
# create a dataframe with a column of names in 'description'
df = pd.DataFrame({'value': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4},
'description': {0: 'Aron',
1: 'Aaron',
2: 'Aronn',
3: 'John',
4: 'Jhon',
5: 'Liam'}})
# find a list somewhere on the t'internet of first names (Liam, John, etc. looks like English)
url = r'https://www.britishbabynames.com/blog/data-analysis-and-statistics/'
# this takes about 20 seconds!
page = requests.get(url)
table = pd.read_html(page.text)
# tidy up the table we downloaded
df_bnames = table[0][2:]
df_bnames = df_bnames[1]
df_bnames = df_bnames.dropna()
df_bnames = df_bnames.reset_index(drop=True)
# put the names from the web into a list
source_names = list(df_bnames)
# fuzzywuzzy function - magic kung fu
def match_names(name, list_names, min_score=0):
max_score = -1
max_name = ''
for x in list_names:
score = fuzz.ratio(name, x)
if (score > min_score) & (score > max_score):
max_name = x
max_score = score
return (max_name, max_score)
# from the original frame get the unique names as a list
uniquenames = list(df.description.unique())
# score limit - may need to adjust up or down. Depends what's in the dict below
score = 50
#using fuzzywuzzy compare one name list to another
names = []
for x in uniquenames:
match = match_names(x, source_names, score)
if match[1] >= score:
name = (str(x), str(match[0]))
names.append(name)
# we have a dictionary of {...'Jhon': 'John'...} now.
name_dict = dict(names)
# output the original frame
print(df,'\n')
# swap the values in 'description' with the 'better' names in the dictionary
df = df.replace({'description': name_dict})
# see the changes
print(df)
So the result is...
value description
0 2 Aaron
1 2 Aaron
2 2 Aaron
3 3 John
4 3 John
5 4 Liam
Jhon
is fixed and the Aron
and Aronn
are replaced.
Upvotes: 0
Reputation: 23146
Use duplicated()
and ffill()
:
>>> df.where(~df["value"].duplicated()).ffill()
value description
0 2.0 Aron
1 2.0 Aron
2 2.0 Aron
3 3.0 John
4 3.0 John
5 4.0 Liam
Upvotes: 2
Reputation: 18406
Just group the dataframe by value, then call transform, and pass first
as the function to take the first value in Description
column within the group.
>>> df['description']=df.groupby('value')['description'].transform('first')
value description
0 2 Aron
1 2 Aron
2 2 Aron
3 3 John
4 3 John
5 4 Liam
Upvotes: 1