Is it necessary or beneficial to convert pandas column from object to string or int/float type?

Question

I have a pandas df with two variables:

id    name
011    Peter Parker
022    Warners Brother
101    Bruce Wayne

Currently both of them are of object type.

Say I want to create smaller dataframes by filtering with some conditions

df_small = df.loc[df['id']=='011']
df_small2 = df.loc[df['name']=='Peter Parker']

I have thought of and seen people converting the object-type column into other specific data type. My question, do I need to do that at all if I can filter them based on string comparison (as above) already? What are the benefits of converting them into a specific string or int/float type?

sacuL · Accepted Answer

You asked the benefits of converting from string or object dtypes. There are at least 2 I can think of right off the bat. Take the following dataframe for example:

df = pd.DataFrame({'int_col':np.random.randint(0,10,10000), 'str_col':np.random.choice(list('1234567980'), 10000)})

>>> df.head()
   int_col str_col
0        7       0
1        0       1
2        1       8
3        6       1
4        6       0

This dataframe comprises 10000 rows, and has one int column and one object (i.e. string) column for showing.

Memory advantage:

The integer column takes a lot less memory than the object column:

>>> import sys
>>> sys.getsizeof(df['int_col'])
80104
>>> sys.getsizeof(df['str_col'])
660104

Speed advantage:

Since your example is about filtering, take a look at the speed difference when filtering on integers instead of strings:

import timeit

def filter_int(df=df):
    return df.loc[df.int_col == 1]


def filter_str(df=df):
    return df.loc[df.str_col == '1']

>>> timeit.timeit(filter_int, number=100) / 100
0.0006298311000864488
>>> timeit.timeit(filter_str, number=100) / 100
0.0016585511100129225

This type of speed difference could potentially speed up your code significantly in some cases.

Is it necessary or beneficial to convert pandas column from object to string or int/float type?

Answers (1)

Memory advantage:

Speed advantage:

Related Questions