Schoguan
Schoguan

Reputation: 41

How to check for each column in a pandas dataframe whether it is a float or can be transformed into an integer

I am loading a csv file into a pandas dataframe, edit them and export them as a csv again (and repeat that with many different csv's over time). Sometimes, the dataframes contain float values where there should be integers which leads to the issue that my export can not be used in the end.

In the example below, col_1 contains only integers (and only 0's and 1's). col_2 also only contains 0's and 1's but as floats. col_3 contains all different kind of floats (e.g. here a 5.4 and 1.0).

no          col_1       col_2       col_3
1           1           0.0         5.4
2           0           1.0         1.0
...         ...         ...         ...

What I want to do is for each column to check whether it contains floats that only end with ".0" and turn all values of these columns into "integer" columns (in this example, turn values of col_2 into integers). All other columns that contain "true" floats should stay as they are.

What would be the quickest/easiest way to do this?

Upvotes: 1

Views: 1847

Answers (2)

Joe Ferndz
Joe Ferndz

Reputation: 8508

One way to do this will be to divide the value in the column by 1. if the mod returns 0, then its an integer. If all the values of the mod is 0, then the column is an integer. With that, you can do the following to convert the column to integers.

Check if dataframe contains only float & ints

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6],
                    'col2':[1.0,2.0,3.0,4.0,5.0,6.0],
                    'col3':[1.1,2.1,3.1,4.1,5.1,6.1]})

print (df.info())

for col in df.columns:
    if all(df[col]%1==0): df[col] = df[col].astype(int)

print (df.info())

Sample df is:

   col1  col2  col3
0     1   1.0   1.1
1     2   2.0   2.1
2     3   3.0   3.1
3     4   4.0   4.1
4     5   5.0   5.1
5     6   6.0   6.1

Info gives us:

Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    6 non-null      int64  
 1   col2    6 non-null      float64
 2   col3    6 non-null      float64

After the check and conversion, the dtypes are:

 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    6 non-null      int64  
 1   col2    6 non-null      int64  
 2   col3    6 non-null      float64

Check if dataframe contains varying dtypes

Since you want to test this only for floats and ints, you can add the condition before you check for mod of 1.

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6],
                    'col2':[1.0,2.0,3.0,4.0,5.0,6.0],
                    'col3':[1.1,2.1,3.1,4.1,5.1,6.1],
                    'col4':['a','b','c','d','e','f'],
                    'col5':[True,True,True,False,False,False],
                    'col6':pd.date_range('2021.01.01', '2021.01.06').tolist()
                })

print (df)

print (df.info())

for col in df.columns:
    if df[col].dtype.kind in ('if') and all(df[col]%1==0): df[col] = df[col].astype(int)

print (df.info())

dtype.kind can be used to check i for int, f for float, b for bool, O for object and M for datetime. More details about numpy.dtype.kind available in the link here.

The output of this will be:

col1  col2  col3 col4   col5       col6
0     1   1.0   1.1    a   True 2021-01-01
1     2   2.0   2.1    b   True 2021-01-02
2     3   3.0   3.1    c   True 2021-01-03
3     4   4.0   4.1    d  False 2021-01-04
4     5   5.0   5.1    e  False 2021-01-05
5     6   6.0   6.1    f  False 2021-01-06
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   col1    6 non-null      int64         
 1   col2    6 non-null      float64       
 2   col3    6 non-null      float64       
 3   col4    6 non-null      object        
 4   col5    6 non-null      bool          
 5   col6    6 non-null      datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 374.0+ bytes
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   col1    6 non-null      int64         
 1   col2    6 non-null      int64         
 2   col3    6 non-null      float64       
 3   col4    6 non-null      object        
 4   col5    6 non-null      bool          
 5   col6    6 non-null      datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 374.0+ bytes
None

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28644

A quick way is to iterate through the dataframe, check if it is a float, compare to the integer version and finally concatenate:

 from pandas.api.types import is_float_dtype

columns = [col.astype(int) 
           if is_float_dtype(col) 
           and (col.eq(col.astype(int)).all()) 
           else col
           for _, col in df.items()]

pd.concat(columns, axis=1)


    no  col_1   col_2   col_3
0   1   1   0   5.4
1   2   0   1   1.0

There may be a better way though. So, let's wait for more answers from the community.

Dumping into numpy gives a bit more speed:

arr = df.to_numpy()
filters = df.columns[np.any(arr != arr.astype(int), axis=0)]
df.astype({col: int 
           for col in df 
           if col not in filters})

Upvotes: 1

Related Questions