YJZ
YJZ

Reputation: 4204

python pandas upper() not work for string columns

Hi I'm working with the Kaggle Titanic data. I use apply(lambda x: x.upper()) to work on multiple columns, but it doesn't work.

I put the data at my google drive and you can download here.

I test on each column, which is all object type (I think it means str, correct me if it's wrong please). But some columns report 'float' object has no attribute 'upper'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train = pd.read_csv('train.csv', header=0)

train.ix[:,['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']].dtypes
# Name        object
# Sex         object
# Ticket      object
# Cabin       object
# Embarked    object
# dtype: object

train.ix[:,['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']].apply(lambda x: x.upper()) 
# not work

# try each column
train.ix[:,'Name'].apply(lambda x: x.upper()) # works
train.ix[:,'Sex'].apply(lambda x: x.upper()) # works
train.ix[:,'Ticket'].apply(lambda x: x.upper()) # works
train.ix[:,'Cabin'].apply(lambda x: x.upper()) # AttributeError: 'float' object has no attribute 'upper'
train.ix[:,'Embarked'].apply(lambda x: x.upper()) # AttributeError: 'float' object has no attribute 'upper'

Any help's appreciated. thanks!

Upvotes: 2

Views: 5394

Answers (2)

Anton Protopopov
Anton Protopopov

Reputation: 31672

It's because your columns Cabin and Embarked contain NaN values which have dtype np.float. You could check it with casting type for your apply:

In [355]: train.Cabin.apply(lambda x: type(x))[:10]
Out[355]:
0    <class 'float'>
1      <class 'str'>
2    <class 'float'>
3      <class 'str'>
4    <class 'float'>
5    <class 'float'>
6      <class 'str'>
7    <class 'float'>
8    <class 'float'>
9    <class 'float'>
Name: Cabin, dtype: object

So you could use str.upper which handle NaN by default. Or you could fill your NaN values to empty string '' with fillna which has upper method and then use your `lambda function:

In [363]: train.Cabin.fillna('').apply(lambda x: x.upper)[:5]
Out[363]:
0
1     C85
2
3    C123
4
Name: Cabin, dtype: object

In [365]: train.Cabin.str.upper()[:5]
Out[365]:
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

Or if you'd like to save NaN as sting you could fillna with NaN string:

In [369]: train.Cabin.fillna('NaN').apply(lambda x: x.upper())[:5]
Out[369]:
0     NAN
1     C85
2     NAN
3    C123
4     NAN
Name: Cabin, dtype: object

Upvotes: 5

paljenczy
paljenczy

Reputation: 4899

Missing values are present in those columns. These are represented by numpy.nan which is a float. If you use .str.upper() instead of .apply(lambda x: x.upper()), that will recognize this fact and will not produce an error.

Upvotes: 1

Related Questions