Reputation: 8009

pandas: to_numeric for multiple columns

I'm working with the following df:

c.sort_values('2005', ascending=False).head(3)
      GeoName ComponentName     IndustryId IndustryClassification Description                                2004 2005  2006  2007  2008  2009 2010 2011 2012 2013 2014
37926 Alabama Real GDP by state 9          213                    Support activities for mining              99   98    117   117   115   87   96   95   103  102  (NA)
37951 Alabama Real GDP by state 34         42                     Wholesale trade                            9898 10613 10952 11034 11075 9722 9765 9703 9600 9884 10199
37932 Alabama Real GDP by state 15         327                    Nonmetallic mineral products manufacturing 980  968   940   1084  861   724  714  701  589  641  (NA)

I want to force numeric on all of the years:

c['2014'] = pd.to_numeric(c['2014'], errors='coerce')

is there an easy way to do this or do I have to type them all out?

Upvotes: 120

Answers (8)

rasputin

Reputation: 322

I did some benchmarking of different solutions. The fastest solutions won't coerce errors to NaN values.

import pandas as pd
import numpy as np
import timeit
from platform import python_version
print("Python version: ", python_version())
print("pandas version: ", pd.__version__)
print("numpy version: ", np.__version__)

np.random.seed(24)
N = 100000
clean = pd.DataFrame({'A':np.random.choice([1,2,3,4,5], size=N),
                      'B':np.random.choice([1,4,9,6,5,11,22,33,44,55,66,77,88,99,3.21], size=N),
                      'C':np.random.choice([1.1,2.2,3.3,4.4], size=N),
                      'D':np.random.choice([7,0,8], size=N)})

cols = clean.columns
for col in cols:
    clean[col] =clean[col].astype(str)

dirty = clean.copy()
newdf = clean.copy()
for col in cols:
    dirty.loc[dirty.sample(frac=0.05).index, col] = ""
    dirty.loc[dirty.sample(frac=0.03).index, col] = "spam"
    dirty.loc[dirty.sample(frac=0.03).index, col] = "eggs"


d1 = """
for col in cols:
    newdf[col] = pd.to_numeric(dirty[col], errors='coerce')
"""

d2 = """
newdf[cols] = dirty[cols].apply(pd.to_numeric, errors='coerce')
"""

d3 = """
newdf[cols] = dirty[cols].apply(pd.to_numeric, raw=True, errors='coerce')
"""

d4 = """
newdf[cols] = dirty[cols].transform(pd.to_numeric, errors='coerce')
"""

d5 = """
newdf[cols] = pd.to_numeric(dirty[cols].stack(), errors='coerce').unstack()
"""

c1 = """
newdf[cols] = clean[cols].astype(np.float64)
"""

c2 = """
newdf[cols] = clean[cols].values.astype(np.float64)
"""

num = 100
for i in [d1, d2, d3, d4, d5, c1, c2]:
    print(round(timeit.timeit(i, number=num, globals=globals()), 3))

Here's the output:

Python version:  3.10.13
pandas version:  2.0.3
numpy version:  1.25.2
22.585  # d1
23.992  # d2
24.029  # d3
23.914  # d4
34.124  # d5
8.072   # c1
9.846   # c2

Upvotes: 0

Richard Kopcke

Reputation: 51

Consider applying the .astype() or the .convert_dtype() method: The first line specifies a list of the columns in the dataframe, data_df, whose dtypes match those specified in include=. The second line converts the dtype of the "slice" of the dataframe specified by this list of columns to a different dtype. In the third line, a list of column names, columns_mdy, specifies the "slice" of the df to be converted from objects (here, strings that contain only digit characters) to 'int16' types. These methods are simple, and intuitive expressions.

See the following references. The side bar on these pages contains links to useful methods for other tasks.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html

float64_cols = list(data_df.select_dtypes(include='float64').columns)

data_df[float64_cols] = data_df[float64_cols].astype('float32')

data_df[columns_mdy] = data_df[columns_mdy].astype('int16')

Upvotes: 1

Joselin Ceron

Reputation: 502

df.loc[:,'col':] = df.loc[:,'col':].apply(pd.to_numeric, errors = 'coerce')

Upvotes: 0

Ali Siddiq

Reputation: 131

df[cols] = pd.to_numeric(df[cols].stack(), errors='coerce').unstack()

Upvotes: 9

Jeremy

Reputation: 107

If you are looking for a range of columns, you can try this:

df.iloc[7:] = df.iloc[7:].astype(float)

The examples above will convert type to be float, for all the columns begin with the 7th to the end. You of course can use different type or different range.

I think this is useful when you have a big range of columns to convert and a lot of rows. It doesn't make you go over each row by yourself - I believe numpy do it more efficiently.

This is useful only if you know that all the required columns contain numbers only - it will not change "bad values" (like string) to be NaN for you.

Upvotes: 3

MaxU - stand with Ukraine

Reputation: 210832

UPDATE: you don't need to convert your values afterwards, you can do it on-the-fly when reading your CSV:

In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)

In [166]: df.dtypes
Out[166]:
GeoName                    object
ComponentName              object
IndustryId                  int64
IndustryClassification     object
Description                object
2004                        int64
2005                        int64
2006                        int64
2007                        int64
2008                        int64
2009                        int64
2010                        int64
2011                        int64
2012                        int64
2013                        int64
2014                      float64
dtype: object

If you need to convert multiple columns to numeric dtypes - use the following technique:

Sample source DF:

In [271]: df
Out[271]:
     id    a  b  c  d  e    f
0  id_3  AAA  6  3  5  8    1
1  id_9    3  7  5  7  3  BBB
2  id_7    4  2  3  5  4    2
3  id_0    7  3  5  7  9    4
4  id_0    2  4  6  4  0    2

In [272]: df.dtypes
Out[272]:
id    object
a     object
b      int64
c      int64
d      int64
e      int64
f     object
dtype: object

Converting selected columns to numeric dtypes:

In [273]: cols = df.columns.drop('id')

In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

In [275]: df
Out[275]:
     id    a  b  c  d  e    f
0  id_3  NaN  6  3  5  8  1.0
1  id_9  3.0  7  5  7  3  NaN
2  id_7  4.0  2  3  5  4  2.0
3  id_0  7.0  3  5  7  9  4.0
4  id_0  2.0  4  6  4  0  2.0

In [276]: df.dtypes
Out[276]:
id     object
a     float64
b       int64
c       int64
d       int64
e       int64
f     float64
dtype: object

PS if you want to select all string (object) columns use the following simple trick:

cols = df.columns[df.dtypes.eq('object')]

Upvotes: 159

muon

Reputation: 14037

another way is using apply, one liner:

cols = ['col1', 'col2', 'col3']
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)

Upvotes: 108

jezrael

Reputation: 862511

You can use:

print df.columns[5:]
Index([u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011',
       u'2012', u'2013', u'2014'],
      dtype='object')

for col in  df.columns[5:]:
    df[col] = pd.to_numeric(df[col], errors='coerce')

print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN

Another solution with filter:

print df.filter(like='20')
       2004   2005   2006   2007   2008  2009  2010  2011  2012  2013   2014
37926    99     98    117    117    115    87    96    95   103   102   (NA)
37951  9898  10613  10952  11034  11075  9722  9765  9703  9600  9884  10199
37932   980    968    940   1084    861   724   714   701   589   641   (NA)

for col in  df.filter(like='20').columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')
print df
       GeoName      ComponentName  IndustryId  IndustryClassification  \
37926  Alabama  Real GDP by state           9                     213   
37951  Alabama  Real GDP by state          34                      42   
37932  Alabama  Real GDP by state          15                     327   

                                      Description  2004   2005   2006   2007  \
37926               Support activities for mining    99     98    117    117   
37951                            Wholesale  trade  9898  10613  10952  11034   
37932  Nonmetallic mineral products manufacturing   980    968    940   1084   

        2008  2009  2010  2011  2012  2013     2014  
37926    115    87    96    95   103   102      NaN  
37951  11075  9722  9765  9703  9600  9884  10199.0  
37932    861   724   714   701   589   641      NaN

Upvotes: 14

pandas: to_numeric for multiple columns

Answers (8)

Related Questions