pandas split function gives duplicated series?

Question

I have the Titanic dataset, and I want to extract title from people's names using pandas.str.split function.

>>> data.Title = data.Name.str.split('[,.]').str.get(1)
>>> data.Title

which result in the following, look just fine:

0           Mr
1          Mrs
2         Miss
3          Mrs
4           Mr
5           Mr
6           Mr
7       Master
8          Mrs
...
Name: Name, Length: 1309, dtype: object

it seems like each row has only on string which is Mr or Mrs or anything else. But if I index only one row, it shows this

>>> data.Name.str.split('[,.]').str.get(1)[0]
0     Mr
0     Mr
Name: Name, dtype: object

which I have no idea why is this happening, and I can't filter dataframe either:

data.Title == 'Mr'
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
...

jezrael · Accepted Answer

data.Name.str.split('[,.]').str.get(1)[0]

means select all rows with index == 0. If duplicated indices get more rows.

So is necessary create unique index:

 data = data.reset_index(drop=True)

For second problem there are traling whitespaces, so is necessary remove them by strip:

data.Title = data.Name.str.split('[,.]').str.get(1).str.strip()

All together:

data = data.reset_index(drop=True)
data.Title = data.Name.str.split('[,.]').str.get(1).str.strip()

pandas split function gives duplicated series?

Answers (1)

Related Questions