louis delamarre
louis delamarre

Reputation: 9

how to rename badly typed students' name in a column in a dataframe based on a reference list

We have student anwser MCQs after each lessons on socrative They enter their name first, then anwser. For each lesson, we collect data from the Socrative platform but have issues "normalizing the names" such as 'John Doe', johndoe' or John,Doe' can be transformed into 'doe', as it is written is our main file.

Our main file for following up students (treated as a dataframe with python) has initially just 1 column, the name (as a string 'doe' for Mr. John Doe).

I'l like to write a function that goes through the 'name' column of my lesson1 dataframe and for each value of the name column, replace the badly typed name by the reference name.

To lower the case, suppress excessive spaces and suppress excessive punctuation, i've used the following code

lesson1["name"] = lesson1["name"].str.lower()
lesson1["name"] = lesson1["name"].str.strip()
import re
lesson1["name"]=lesson1["name"].apply(lambda x : re.sub('[^A-Za-z0-9]+', '', x))

Then I want to change the 'name' values for the reference name is necessary I've tried the following code on 2 lists

bad=lesson1['name'] 
good=reference['name']


def changenames(lesson_list, reference_list):
    for i,name in enumerate(lesson_list):
        for j,ref in enumerate(reference_list):
            if ref in name:
                lesson_list[i]=ref

changenames(bad,good)

but 1/ it's not working due to SettingWithCopyWarning 2/ i fail to apply it to a column of the dataframe

Could you help me ? Thx L.

Upvotes: 0

Views: 59

Answers (1)

louis delamarre
louis delamarre

Reputation: 9

I've found out a way

I've 2 dataframes - the reference_list dataframe, with the names of the students. It has a column 'name' - the lesson dataframe with the names as the students type them when they answer the MCQs (not standardized) and the answers to the MCQs.

To transform the names of the students in the lesson dataframe, based on the well-types names in reference_list['name'], i have used :

for i in lesson['name']:
    for ref in reference_list['name']:
        if ref in i:
            lesson.loc[lesson['name'] == i, 'name']=ref

and it works fine, After that, you can apply functions to treat duplicates, merge data...

I've found help in this thread Replace single value in a pandas dataframe, when index is not known and values in column are unique

Hope it'll help some of you. Louis

Upvotes: 0

Related Questions