Pud
Pud

Reputation: 1

How to deal with long names in data cleaning?

I have a users database. I want to separate them into two columns to have user1 and user2. The way I was solving this was to split the names into multiple columns then merge the names to have the two columns of users.

The issue I run into is some names are long and after the split. Those names take some spot on the data frame which makes it harder to merge properly.

Users
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi

After I split the user columns

0 1 2 3 4 5 6 7 8
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi

The expected result is the following

User1 User2
Maria Melinda Del valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi

Upvotes: 0

Views: 76

Answers (1)

Corralien
Corralien

Reputation: 120499

You can use:

def f(sr):
   m = sr.isna().cumsum().loc[lambda x: x < 2]
   return sr.dropna().groupby(m).apply(' '.join)

out = df.apply(f, axis=1).rename(columns=lambda x: f'User{x+1}')

Output:

>>> out
                     User1               User2
0  Maria Melinda Del Valle       Justin Howard
1          Devin Craig Jr.  Michael Carter III
2       Jeanne De Bordeaux             Alhamdi

As suggested by @Barmar, If you know where to put the blank columns in the first split, you should know how to create both columns.

Upvotes: 1

Related Questions