Maeaex1
Maeaex1

Reputation: 755

Automatically determine header row when reading csv in pandas

I am trying to collect data from different .csv files, that share the same column names. However, some csv files have their headers located in different rows.

Is there a way to determine the header row dynamically based on the first row that contains "most" values (the actual header names)?

I tried the following:

def process_file(file, path, col_source, col_target):
    global df_master
    print(file)
    df = pd.read_csv(path + file, encoding = "ISO-8859-1", header=None)
    df = df.dropna(thresh=2) ## Drop the rows that contain less than 2 non-NaN values. E.g. metadata
    df.columns = df.iloc[0,:].values
    df = df.drop(df.index[0])

However, when using pandas.read_csv(), it seems like the very first value determines the size of the actual dataframe as I receive the following error message:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 162

As you can see in this case the header row would have been located in row 4. When adding error_bad_lines=False to read_csv, only the metadata will be read into the dataframe.

The files can have either the structure of:

a "Normal" File:

row1    col1   col2    col3    col4   col5   
row2    val1   val1    val1    val1   val1
row3    val2   val2    val2    val2   val2   
row4

or a structure with meta data before header:

row1   metadata1    
row2   metadata2
row3   col1   col2    col3    col4   col5
row4   val1   val1    val1    val1   val1

Any help much appreciated!

Upvotes: 9

Views: 10462

Answers (4)

Rajesh Kumar
Rajesh Kumar

Reputation: 21

Better way is to search where the data starts using csv sniffing and the row above it will give the CSV column header.

import csv 
import pandas as pd    
Expected_Delimiter= "," 
count =0

with open(path,"r+") as f:
    while True:
        sniffer = csv.Sniffer()
        line = f.readline()
        count = count+1
        # Breaking the loop if file reaches eof
        if not (line):
            break
        Dialect =sniffer.sniff(line)
        file_Delimiter = Dialect.delimiter
        # Breaking loop if delimiter is found
        if (file_Delimiter == Expected_Delimiter):
            break
        else:
            continue

skiprows = count -1     
CSV_data = pd.read_csv(path,sep=Expected_Delimiter,skiprows =skiprows, encoding = "ISO-8859-1")

Upvotes: 1

I_Tried
I_Tried

Reputation: 41

This is what I did. It doesn't give you false positives like other things I tried. You basically don't want empty records in your row.

  1. Create your dataframe (df below) and give it a header index of 0.
  2. Now iterate through it:

    if df is not False:
        ind = 0
        notfound = True
    
        while notfound:
            for index, row in df.iterrows():
                s = pd.Series(row)
                if s.isnull().values.any():
                    ind += 1
                else:
                    notfound = False
                    break
        return ind
    else:
        return False
    
  3. Remake your dataframe but pass it the header index returned from the above + 1. So if it returns ind = 5, you pass 6

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 148965

IMHO the simplest way if to forget pandas for a while:

  • you open the file as a text file for reading
  • you start parsing it line by line, guessing whether the line is
    • metadata header
    • the true header line
    • data lines

A simple way is to concatenate all the lines starting from the true header line in a single string (let us call it buffer), and then use pd.read_csv(io.StringIO(buffer), ...)

Upvotes: 3

Jano
Jano

Reputation: 455

A bit dirty, but this works. Basically it consists of trying to read the file ignoring top rows from 0 to the whole file. As soon as something is possible for a csv, it will return it. Adapt the custom_csv to your needs.

import pandas as pd

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def custom_csv(fname):
    _file_len = file_len(fname)
    for i in range(_file_len):
        try:
            df = pd.read_csv(fname, skiprows=i)
            return df
        except Exception:
            print(i)
    return 
print(custom_csv('pollution.csv'))

Upvotes: 2

Related Questions