Reputation: 57
df1:
product product_Id Price
0 Mobile G67129 4500
1 Earphone H56438 8900
2 Heater K12346 fgdht
3 Kitchen 566578 4500
4 4359 Gh1907 5674
5 plastic G67129 Dfz67
df2:
Column_Name Expected_Dtype
0 product String
1 product_Id String
2 Price int
I need to find out the data type error values from df1 and has column datatype information in df2.
Output:
column_Name Value Exp_dtype index
0 product 4359 String 4
1 product_Id 566578 String 3
2 Price fgdht int 2
3 Price Dfz67 int 5
Upvotes: 2
Views: 4542
Reputation: 199
DataFrame
import pandas as pd
import re
data = {'A':['1','2.0','4b','dog'], 'B':['12','tom','lom','so'],'C':['dog','tom','jerry','55']}
df = pd.DataFrame(data=data)
print(df)
A B C
0 1 12 dog
1 2.0 tom tom
2 4b lom jerry
3 dog so 55
Now assign the list of column which are string or int/float
int_col = ['A'] #I am considering col A numerical
string_col = ['B','C'] # col B & C as string
Fun Part
num_pattern = "^\d+\.?\d*$" #this identify int and float both
int_float = []
errors_num = []
errors_string = []
for col in df[int_col].columns: #here detecting errors in numerical col
i = 0
for cells in df[int_col][col]:
if re.findall(num_pattern,cells):
int_float.append(cells)
if cells not in int_float:
errors_num.append({"column":col, "errors":cells, "index": i,'correct_datatype': 'float'})
i += 1
for col in df[string_col].columns: #here detecting errors in string col
i = 0
for cells in df[string_col][col]:
if re.findall(num_pattern,cells):
errors_string.append({"column":col, "errors":cells, "index": i, 'correct_datatype': 'string'})
i += 1
Now we have 2 errors list on column data type. We will join the list
and print DataFrame which will print Col_name
, error
, index num
, correct Data type
pd.DataFrame(data=errors_string + errors_num)
column errors index correct_datatype
0 B 12 0 string
1 C 55 3 string
2 A 4b 2 float
3 A dog 3 float
Upvotes: 1
Reputation: 697
As those types mixed up, all being object, I can only think of using str match with regex pattern to pick out error types.
Here is my solution:
find rows with error types first
bad_product = df['product'].loc[df['product'].str.match(r'[0-9.]+')]
bad_product_ID = df.product_Id.loc[df['product_Id'].str.match(r'[0-9.]+')]
bad_price = df.Price.loc[~df['Price'].str.match(r'[0-9.]+')]
join error rows all together
df3 = pd.concat([bad_product,bad_product_ID,bad_price], axis=1).stack().reset_index()
df3.columns = ['index', 'Column_Name', 'value']
merge it with df2
df2.set_index('Column_Name')
df3.set_index('Column_Name')
result = pd.merge(df3, df2, how='left')
result:
index Column_Name value Expected_Dtype
0 2 Price fgdht int
1 3 product_Id 566578 String
2 4 product 4359 String
3 5 Price Dfz67 int
when you have no idea to begin with, try to break it down to small task. Hope this would help.
Upvotes: 1