Reputation: 41
I am loading a csv file into a pandas dataframe, edit them and export them as a csv again (and repeat that with many different csv's over time). Sometimes, the dataframes contain float values where there should be integers which leads to the issue that my export can not be used in the end.
In the example below, col_1
contains only integers (and only 0's and 1's). col_2
also only contains 0's and 1's but as floats. col_3
contains all different kind of floats (e.g. here a 5.4 and 1.0).
no col_1 col_2 col_3
1 1 0.0 5.4
2 0 1.0 1.0
... ... ... ...
What I want to do is for each column to check whether it contains floats that only end with ".0" and turn all values of these columns into "integer" columns (in this example, turn values of col_2
into integers). All other columns that contain "true" floats should stay as they are.
What would be the quickest/easiest way to do this?
Upvotes: 1
Views: 1847
Reputation: 8508
One way to do this will be to divide the value in the column by 1. if the mod returns 0, then its an integer. If all the values of the mod is 0, then the column is an integer. With that, you can do the following to convert the column to integers.
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6],
'col2':[1.0,2.0,3.0,4.0,5.0,6.0],
'col3':[1.1,2.1,3.1,4.1,5.1,6.1]})
print (df.info())
for col in df.columns:
if all(df[col]%1==0): df[col] = df[col].astype(int)
print (df.info())
Sample df is:
col1 col2 col3
0 1 1.0 1.1
1 2 2.0 2.1
2 3 3.0 3.1
3 4 4.0 4.1
4 5 5.0 5.1
5 6 6.0 6.1
Info gives us:
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null float64
2 col3 6 non-null float64
After the check and conversion, the dtypes are:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null int64
2 col3 6 non-null float64
Since you want to test this only for floats
and ints
, you can add the condition before you check for mod of 1.
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6],
'col2':[1.0,2.0,3.0,4.0,5.0,6.0],
'col3':[1.1,2.1,3.1,4.1,5.1,6.1],
'col4':['a','b','c','d','e','f'],
'col5':[True,True,True,False,False,False],
'col6':pd.date_range('2021.01.01', '2021.01.06').tolist()
})
print (df)
print (df.info())
for col in df.columns:
if df[col].dtype.kind in ('if') and all(df[col]%1==0): df[col] = df[col].astype(int)
print (df.info())
dtype.kind can be used to check i
for int
, f
for float
, b
for bool
, O
for object
and M
for datetime
. More details about numpy.dtype.kind available in the link here.
The output of this will be:
col1 col2 col3 col4 col5 col6
0 1 1.0 1.1 a True 2021-01-01
1 2 2.0 2.1 b True 2021-01-02
2 3 3.0 3.1 c True 2021-01-03
3 4 4.0 4.1 d False 2021-01-04
4 5 5.0 5.1 e False 2021-01-05
5 6 6.0 6.1 f False 2021-01-06
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null float64
2 col3 6 non-null float64
3 col4 6 non-null object
4 col5 6 non-null bool
5 col6 6 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 374.0+ bytes
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null int64
2 col3 6 non-null float64
3 col4 6 non-null object
4 col5 6 non-null bool
5 col6 6 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 374.0+ bytes
None
Upvotes: 1
Reputation: 28644
A quick way is to iterate through the dataframe, check if it is a float, compare to the integer version and finally concatenate:
from pandas.api.types import is_float_dtype
columns = [col.astype(int)
if is_float_dtype(col)
and (col.eq(col.astype(int)).all())
else col
for _, col in df.items()]
pd.concat(columns, axis=1)
no col_1 col_2 col_3
0 1 1 0 5.4
1 2 0 1 1.0
There may be a better way though. So, let's wait for more answers from the community.
Dumping into numpy gives a bit more speed:
arr = df.to_numpy()
filters = df.columns[np.any(arr != arr.astype(int), axis=0)]
df.astype({col: int
for col in df
if col not in filters})
Upvotes: 1