Reputation: 39
I have just coded this:
import os
import pandas as pd
files = os.listdir(path)
#AllData = pd.DataFrame()
for f in files:
info = pd.read_excel(f, "File")
info.fillna(0)
try:
info['Country'] = info['Country'].astype('str')
except ValueError:
continue
try:
info['Name'] = info['Name'].astype('str')
except ValueError:
continue
try:
info['Age'] = info['Age'].astype('int')
except ValueError as error:
continue
writer = pd.ExcelWriter("Output.xlsx")
info.to_excel(writer, "Sheet 1")
writer.save()
It reads some excel files, selects a sheet named "File" and put all its data in a dataframe. Once it is done, it returns all the records.
What I want is to check the types of all the values of each column, and to skip the line in the reading source if the type is not the one I want for this column. Finally I want to record in the output the data that fits the types I want.
I tried to use astype
but that's not working as expected.
Thus, read source - check astype - if not astype - skip line and keep running the code.
Upvotes: 2
Views: 787
Reputation: 1923
I first have to say that type checking and type casting are 2 different things.
Pandas' astype
is used for type casting (it will "convert" a type to another type, it will not check if a value is of certain type) .
But if what you want is to not keep the rows that can't be cast as numeric type, you can do it like this:
info['Age'] = pd.to_numeric(info['Age'], errors='coerce')
info = info.dropna()
Note that you don't have to use a try-except block here.
Here, we use to_numeric
because we can pass errors='coerce'
, so that if it can't be cast, the value will be NaN
, and then we use dropna()
in order to remove rows contaiing NaN
s.
Here I'll add some informations you asked in comment about how to check types in pandas dataframes:
How to get the types infered by pandas for each column?
columns_dtypes = df.dtypes
It will output something like this:
Country object
Name object
Age int64
dtype: object
Note that i your column "Age" contains some Nan
values the dtype
could be float64
.
And when a column contains strings, the dtype
will be object
when you'll load your excel file to a dataframe like in your example.
See below for how to check if an object is a Python string (type str
).
Pandas documentation listing all dtypes: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html?highlight=basics#dtypes
Other useful information about Pandas dtypes : what are all the dtypes that pandas recognizes?
How to check the types of all values of the whole dataframe?
There are numerous ways of doing this.
Here is one way. I choose this code because it's clear and simple:
# Iterate over all the columns
for (column_name, column_data) in info.iteritems():
print("column_name: ", column_name)
print("column_data: ", column_data.values)
# Iterate over all the values of this column
for column_value in column_data.values:
# print the value and its type
print(column_value, type(column_value))
# So here you can check the type and do something with that
# For example, log the error to a log file
Some useful functions for type checkings:
How to test if object
(as returned by df.dtypes
like in the output above) is a string?
isinstance(object_to_test, str)
See: How to find out if a Python object is a string?
Now, if you have a column that contains strings (like "hello", "world", etc.) and some of these strings are int
, and you want to check if these stings represent a number, or a int
you can use these functions:
How to check if a string is an int
?
def str_is_int(s):
try:
int(s)
return True
except ValueError:
return False
How to check if a string is an number?
def str_is_number(s):
try:
float(s)
return True
except ValueError:
return False
Python's strings have a method isdigit()
, but it can't be used to check for int or number, because it will fail with one = "+1"
or minus_one = "-1"
.
And finally, here are 2 common ways to check "types" in Python:
object_to_test = 1
print( type(object_to_test) is int)
print( type(object_to_test) in (int, float) ) # Check is is one of those types
print( isinstance(object_to_test, int) )
isinstance(object_to_test, str)
will return True
if object_to_test
is of type str
OR any sublass of str
.
type(object_to_test) is str
will return True
if object_to_test
is ONLY of type str
(excluding any subclass of str
)
There is also a libray called pandas-stubs
that could be useful for type safety: https://github.com/VirtusLab/pandas-stubs.
Upvotes: 1