kalyan
kalyan

Reputation: 57

How to find data type error in pandas dataframe?

df1:

  product   product_Id   Price
0 Mobile      G67129     4500
1 Earphone    H56438     8900
2 Heater      K12346     fgdht
3 Kitchen     566578     4500
4 4359        Gh1907     5674
5 plastic     G67129     Dfz67

df2:

  Column_Name   Expected_Dtype
0 product          String
1 product_Id       String
2 Price            int

I need to find out the data type error values from df1 and has column datatype information in df2.

Output:

   column_Name  Value  Exp_dtype index
0  product       4359  String    4
1  product_Id   566578 String    3
2  Price       fgdht    int      2
3  Price       Dfz67    int      5

Upvotes: 2

Views: 4542

Answers (2)

vishvas chauhan
vishvas chauhan

Reputation: 199

DataFrame

import pandas as pd
import re
data = {'A':['1','2.0','4b','dog'], 'B':['12','tom','lom','so'],'C':['dog','tom','jerry','55']}
df = pd.DataFrame(data=data)
print(df)

    A   B   C
0   1   12  dog
1   2.0 tom tom
2   4b  lom jerry
3   dog so  55

Now assign the list of column which are string or int/float

int_col = ['A'] #I am considering col A numerical
string_col = ['B','C'] # col B & C as string

Fun Part

num_pattern = "^\d+\.?\d*$" #this identify int and float both
int_float = []
errors_num = []
errors_string = []
for col in df[int_col].columns: #here detecting errors in numerical col
  i = 0
  for cells in df[int_col][col]:
    if re.findall(num_pattern,cells):
      int_float.append(cells)
    if cells not in int_float:
      errors_num.append({"column":col, "errors":cells, "index": i,'correct_datatype': 'float'})
    i += 1

for col in df[string_col].columns: #here detecting errors in string col
  i = 0
  for cells in df[string_col][col]:
    if re.findall(num_pattern,cells):
      errors_string.append({"column":col, "errors":cells, "index": i, 'correct_datatype': 'string'})
    i += 1

Now we have 2 errors list on column data type. We will join the list and print DataFrame which will print Col_name, error, index num, correct Data type

pd.DataFrame(data=errors_string + errors_num)

  column    errors  index   correct_datatype
0   B       12         0    string
1   C       55         3    string
2   A       4b         2    float
3   A       dog        3    float

Upvotes: 1

Sheng Zhuang
Sheng Zhuang

Reputation: 697

As those types mixed up, all being object, I can only think of using str match with regex pattern to pick out error types.

Here is my solution:

find rows with error types first

bad_product = df['product'].loc[df['product'].str.match(r'[0-9.]+')]
bad_product_ID = df.product_Id.loc[df['product_Id'].str.match(r'[0-9.]+')]
bad_price = df.Price.loc[~df['Price'].str.match(r'[0-9.]+')]

join error rows all together

df3 = pd.concat([bad_product,bad_product_ID,bad_price], axis=1).stack().reset_index()
df3.columns = ['index', 'Column_Name', 'value']

merge it with df2

df2.set_index('Column_Name')
df3.set_index('Column_Name')
result = pd.merge(df3, df2, how='left')

result:


  index Column_Name value   Expected_Dtype
0   2   Price       fgdht   int
1   3   product_Id  566578  String
2   4   product     4359    String
3   5   Price       Dfz67   int

when you have no idea to begin with, try to break it down to small task. Hope this would help.

Upvotes: 1

Related Questions