Reputation: 4723
I have a pandas df
with two variables:
id name
011 Peter Parker
022 Warners Brother
101 Bruce Wayne
Currently both of them are of object type.
Say I want to create smaller dataframes by filtering with some conditions
df_small = df.loc[df['id']=='011']
df_small2 = df.loc[df['name']=='Peter Parker']
I have thought of and seen people converting the object-type column into other specific data type. My question, do I need to do that at all if I can filter them based on string comparison (as above) already? What are the benefits of converting them into a specific string or int/float type?
Upvotes: 2
Views: 907
Reputation: 51335
You asked the benefits of converting from string
or object
dtypes. There are at least 2 I can think of right off the bat. Take the following dataframe for example:
df = pd.DataFrame({'int_col':np.random.randint(0,10,10000), 'str_col':np.random.choice(list('1234567980'), 10000)})
>>> df.head()
int_col str_col
0 7 0
1 0 1
2 1 8
3 6 1
4 6 0
This dataframe comprises 10000 rows, and has one int
column and one object
(i.e. string) column for showing.
The integer column takes a lot less memory than the object column:
>>> import sys
>>> sys.getsizeof(df['int_col'])
80104
>>> sys.getsizeof(df['str_col'])
660104
Since your example is about filtering, take a look at the speed difference when filtering on integers instead of strings:
import timeit
def filter_int(df=df):
return df.loc[df.int_col == 1]
def filter_str(df=df):
return df.loc[df.str_col == '1']
>>> timeit.timeit(filter_int, number=100) / 100
0.0006298311000864488
>>> timeit.timeit(filter_str, number=100) / 100
0.0016585511100129225
This type of speed difference could potentially speed up your code significantly in some cases.
Upvotes: 3