Rafael Fernandes
Rafael Fernandes

Reputation: 39

How to skip the lines of an excel file loaded to a Pandas dataframe if data types are wrong (checking types)

I have just coded this:

import os
import pandas as pd

files = os.listdir(path)

#AllData = pd.DataFrame() 

for f in files:
    info = pd.read_excel(f, "File")
    info.fillna(0)
    try:
        info['Country'] = info['Country'].astype('str')
    except ValueError:
        continue
    try:
        info['Name'] = info['Name'].astype('str')
    except ValueError:
        continue
    try:
        info['Age'] = info['Age'].astype('int')
    except ValueError as error:
        continue
        
    writer = pd.ExcelWriter("Output.xlsx")
    info.to_excel(writer, "Sheet 1")
    writer.save()

It reads some excel files, selects a sheet named "File" and put all its data in a dataframe. Once it is done, it returns all the records.

What I want is to check the types of all the values of each column, and to skip the line in the reading source if the type is not the one I want for this column. Finally I want to record in the output the data that fits the types I want.

I tried to use astype but that's not working as expected.

Thus, read source - check astype - if not astype - skip line and keep running the code.

Upvotes: 2

Views: 787

Answers (1)

Rivers
Rivers

Reputation: 1923

I first have to say that type checking and type casting are 2 different things.

Pandas' astype is used for type casting (it will "convert" a type to another type, it will not check if a value is of certain type) .

But if what you want is to not keep the rows that can't be cast as numeric type, you can do it like this:

info['Age'] = pd.to_numeric(info['Age'], errors='coerce')
info = info.dropna()

Note that you don't have to use a try-except block here. Here, we use to_numeric because we can pass errors='coerce', so that if it can't be cast, the value will be NaN, and then we use dropna() in order to remove rows contaiing NaNs.

Update about type checking:

Here I'll add some informations you asked in comment about how to check types in pandas dataframes:

  • How to get the types inferred by pandas for each column?
  • How to check the types of all values of the whole dataframe?
  • Some useful functions for type checkings
  • Ways to check types in Python

How to get the types infered by pandas for each column?

columns_dtypes = df.dtypes

It will output something like this:

Country     object
Name        object
Age        int64
dtype: object

Note that i your column "Age" contains some Nan values the dtype could be float64.

And when a column contains strings, the dtype will be object when you'll load your excel file to a dataframe like in your example. See below for how to check if an object is a Python string (type str).

Pandas documentation listing all dtypes: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html?highlight=basics#dtypes

Other useful information about Pandas dtypes : what are all the dtypes that pandas recognizes?

How to check the types of all values of the whole dataframe?

There are numerous ways of doing this.

Here is one way. I choose this code because it's clear and simple:

# Iterate over all the columns
for (column_name, column_data) in info.iteritems():
    print("column_name: ", column_name)
    print("column_data: ", column_data.values)

    # Iterate over all the values of this column
    for column_value in column_data.values:
        # print the value and its type
        print(column_value, type(column_value))
        # So here you can check the type and do something with that
        # For example, log the error to a log file

Some useful functions for type checkings:

How to test if object (as returned by df.dtypes like in the output above) is a string? isinstance(object_to_test, str) See: How to find out if a Python object is a string?

Now, if you have a column that contains strings (like "hello", "world", etc.) and some of these strings are int, and you want to check if these stings represent a number, or a int you can use these functions:

How to check if a string is an int?

def str_is_int(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

How to check if a string is an number?

def str_is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

Python's strings have a method isdigit(), but it can't be used to check for int or number, because it will fail with one = "+1" or minus_one = "-1".

And finally, here are 2 common ways to check "types" in Python:

object_to_test = 1

print( type(object_to_test) is int)
print( type(object_to_test) in (int, float) ) # Check is is one of those types

print( isinstance(object_to_test, int) )

isinstance(object_to_test, str) will return True if object_to_test is of type str OR any sublass of str.

type(object_to_test) is str will return True if object_to_test is ONLY of type str (excluding any subclass of str)

There is also a libray called pandas-stubs that could be useful for type safety: https://github.com/VirtusLab/pandas-stubs.

Upvotes: 1

Related Questions