Reputation: 267
Python newbie here, I will like to flag sporadic numbers that are obviously off from the rest of the row. In simple terms, flag numbers that seem not to belong to each row. Numbers in 100s and 100000s are considered 'off the rest'
import pandas as pd
# intialise data of lists.
data = {'A':['R1', 'R2', 'R3', 'R4', 'R5'],
'B':[12005, 18190, 1021, 13301, 31119,],
'C':[11021, 19112, 19021,15, 24509 ],
'D':[10022,19910, 19113,449999, 25519],
'E':[14029, 29100, 39022, 24509, 412271],
'F':[52119,32991,52883,69359,57835],
'G':[41218, 52991,1021,69152,79355],
'H': [43211,7672991,56881,211,77342],
'J': [31211,42901,53818,62158,69325],
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.describe()
I am trying to do something exactly like this
# I need help with step 1
#my code/pseudocode
# step 1: identify the values in each row that are don't belong to the group
# step 2: flag the identified values and export to excel
style_df = .applymap(lambda x: "background-color: yellow" if x else "") # flags the values that meets the criteria
with pd.ExcelWriter("flagged_data.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,index=False)
Upvotes: 0
Views: 595
Reputation: 5836
I used two conditions here one to check less than 1000 and another one for greater than 99999. Based on this condition, the code will highlight outliers in red color.
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_conditional.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add a format. Light red fill with dark red text.
format1 = workbook.add_format({'bg_color': '#FFC7CE',
'font_color': '#9C0006'})
first_row = 1
first_col = 2
last_row = len(df)
last_col = 9
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '<',
'value': 1000,
'format': format1})
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '>',
'value': 99999,
'format': format1})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Upvotes: 2
Reputation: 2249
If you don't need to use machine learning outliers detection or Hampel filter and you already know the limits of your filter, you can simply do
def higlight_outliers(s):
# force to numeric and coerce string to NaN
s = pd.to_numeric(s, errors='coerce')
indexes = (s<1500)|(s>1000000)
return ['background-color: yellow' if v else '' for v in indexes]
styled = df.style.apply(higlight_outliers, axis=1)
styled.to_excel("flagged_data.xlsx", index=False)
Upvotes: 1
Reputation: 44
I guess you could define a little better what you consider "off from the rest". This is very important when working with data.
Do you want to flag the outliers of your column B distribution for example? You could simply do a calculation of quartiles for your distributions and append those to a dict of some kind, those which are either below the lowest quartile or over the highest quartile. But you obviously would need more than those 5 rows you showed.
There are whole fields dedicated to identification of outliers using machine learning as well. The assumptions you are taking to define what should be considered "off from the rest" are very important.
Read this if you'd like more info on specifics of outlier detection: https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Upvotes: 0