Student
Student

Reputation: 1197

How to appropriately test Pandas dtypes within dataframes?

Objective: to create a function that can match given dtypes to a predfined data type scenario.

Description: I want to be able to classify given datasets based on their attribution into predefined scenario types.

Below are two example datasets (df_a and df_b). df_a has only dtypes that are equal to 'object' while df_b has both 'object' and 'int64':

# scenario_a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]  
df_a = pd.DataFrame(data, columns = ['Name','Color']) 
df_a['Color'] = df_a['Color'].astype('object')

# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]  
df_b = pd.DataFrame(data, columns = ['Name', 'Age'])

I want to be able to determine automatically which scenario it is based on a function:

import pandas as pd
import numpy as np

def scenario(data):
    if data.dtypes.str.contains('object'):
        return scenario_a
    if data.dtypes.str.contatin('object', 'int64'):
        return scenario_b

Above is what I have so far, but isn't getting the results I was hoping for.

When using the function scenario(df_a) I am looking for the result to be scenario_a and when I pass df_b I am looking for the function to be able to determine, correctly, what scenario it should be.

Any help would be appreciated.

Upvotes: 1

Views: 44

Answers (1)

Chris Adams
Chris Adams

Reputation: 18647

Here is one approach. Create a dict scenarios, with the keys a sorted tuple of predefined dtypes, and the value being what you would want returned by the function.

Using your example, something like:

# scenario a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]  
df_a = pd.DataFrame(data_a, columns = ['Name','Color']) 
df_a['Color'] = df_a['Color'].astype('object')

# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]  
df_b = pd.DataFrame(data_b, columns = ['Name', 'Age'])

scenario_a = tuple(sorted(df_a.dtypes.unique()))
scenario_b = tuple(sorted(df_b.dtypes.unique()))

scenarios = {
    scenario_a: 'scenario_a',
    scenario_b: 'scenario_b'
}

print(scenarios)

# scenarios:
# {(dtype('O'),): 'scenario_a', (dtype('int64'), dtype('O')): 'scenario_b'}

def scenario(data):
    dtypes = tuple(sorted(data.dtypes.unique()))
    return scenarios.get(dtypes, None)

scenario(df_a)
# 'scenario_a'

scenario(df_b)
# scenario_b

Upvotes: 1

Related Questions