Reputation: 755
I am trying to collect data from different .csv files, that share the same column names. However, some csv files have their headers located in different rows.
Is there a way to determine the header row dynamically based on the first row that contains "most" values (the actual header names)?
I tried the following:
def process_file(file, path, col_source, col_target):
global df_master
print(file)
df = pd.read_csv(path + file, encoding = "ISO-8859-1", header=None)
df = df.dropna(thresh=2) ## Drop the rows that contain less than 2 non-NaN values. E.g. metadata
df.columns = df.iloc[0,:].values
df = df.drop(df.index[0])
However, when using pandas.read_csv()
, it seems like the very first value determines the size of the actual dataframe as I receive the following error message:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 162
As you can see in this case the header row would have been located in row 4.
When adding error_bad_lines=False
to read_csv, only the metadata will be read into the dataframe.
The files can have either the structure of:
a "Normal" File:
row1 col1 col2 col3 col4 col5
row2 val1 val1 val1 val1 val1
row3 val2 val2 val2 val2 val2
row4
or a structure with meta data before header:
row1 metadata1
row2 metadata2
row3 col1 col2 col3 col4 col5
row4 val1 val1 val1 val1 val1
Any help much appreciated!
Upvotes: 9
Views: 10462
Reputation: 21
Better way is to search where the data starts using csv sniffing and the row above it will give the CSV column header.
import csv
import pandas as pd
Expected_Delimiter= ","
count =0
with open(path,"r+") as f:
while True:
sniffer = csv.Sniffer()
line = f.readline()
count = count+1
# Breaking the loop if file reaches eof
if not (line):
break
Dialect =sniffer.sniff(line)
file_Delimiter = Dialect.delimiter
# Breaking loop if delimiter is found
if (file_Delimiter == Expected_Delimiter):
break
else:
continue
skiprows = count -1
CSV_data = pd.read_csv(path,sep=Expected_Delimiter,skiprows =skiprows, encoding = "ISO-8859-1")
Upvotes: 1
Reputation: 41
This is what I did. It doesn't give you false positives like other things I tried. You basically don't want empty records in your row.
Now iterate through it:
if df is not False:
ind = 0
notfound = True
while notfound:
for index, row in df.iterrows():
s = pd.Series(row)
if s.isnull().values.any():
ind += 1
else:
notfound = False
break
return ind
else:
return False
Remake your dataframe but pass it the header index returned from the above + 1. So if it returns ind = 5, you pass 6
Upvotes: 0
Reputation: 148965
IMHO the simplest way if to forget pandas for a while:
A simple way is to concatenate all the lines starting from the true header line in a single string (let us call it buffer
), and then use pd.read_csv(io.StringIO(buffer), ...)
Upvotes: 3
Reputation: 455
A bit dirty, but this works. Basically it consists of trying to read the file ignoring top rows from 0 to the whole file. As soon as something is possible for a csv, it will return it. Adapt the custom_csv to your needs.
import pandas as pd
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def custom_csv(fname):
_file_len = file_len(fname)
for i in range(_file_len):
try:
df = pd.read_csv(fname, skiprows=i)
return df
except Exception:
print(i)
return
print(custom_csv('pollution.csv'))
Upvotes: 2