RayX500
RayX500

Reputation: 267

How to flag outliers/anomalies in 10 csv files and export the flagged dataframes to 1 xlsx file with 10 sheets

I want to do the following :

  1. flag anomalies in 10 .csv files in a folder (all 10 files have the same column names A-J).

  2. export the flagged data frames to 1 .xlsx with 10 sheets (append the 10 flagged df to 1 xlsx file)

NOTE: The anomaly here is defined as any value lower than 1500 and higher than 100000 in each row.

To flag one file at a time, the code looks something like this :

import pandas as pd
  
# intialise data of lists
data = {'A':['R1', 'R2', 'R3', 'R4', 'R5'],
         'B':[12005, 18190, 1021, 13301, 31119],
        'C':[11021, 19112, 19021,15, 24509 ],
        'D':[10022,19910, 19113,449999, 25519],
        'E':[14029, 29100, 39022, 24509, 412271],
        'F':[52119,32991,52883,69359,57835],
         'G':[41218, 52991,1021,69152,79355],
         'H': [43211,7672991,56881,211,77342],
          'J': [31211,42901,53818,62158,69325],
        }
  
# Create DataFrame
df = pd.DataFrame(data)
  
# Print the output.
df.describe()

## flagging anomalies in each row 
def flag_outliers(s):
    # force to numeric and coerce string to NaN
    s = pd.to_numeric(s, errors='coerce')
    indexes = (s<1500)|(s>1000000)
    return ['background-color: red' if v else '' for v in indexes]

styled = df.style.apply(flag_outliers, axis=1)

styled.to_excel("flagged_outliers.xlsx", index=False)

An example of a flagged data frame looks like this :

enter image description here

Thanks in advance

Upvotes: 1

Views: 211

Answers (1)

anky
anky

Reputation: 75080

I think you can use a loop here to get the read the csv files and apply the style then save as excel in each sheet, note that I have used the same path for reading the csv's and saving the output excel, you can change the path to output by defining a path2 and using it in the writer = .... variable:

path = "full\\path\\to_folder\\"
import glob
import os

files = glob.glob(path+"\*.csv")

writer = pd.ExcelWriter(os.path.join(path,"combined.xlsx"), engine = 'xlsxwriter')
for file in files:    
    dataframe = pd.read_csv(file)
    (dataframe.style.apply(flag_outliers, axis=1)
     .to_excel(writer, sheet_name = os.path.split(file)[-1][:-4],index=False))
writer.save()

This will save the dataframes in each sheet with the csv filenames.

Upvotes: 1

Related Questions