Dennis
Dennis

Reputation: 81

pandas read csv where csv's col names are in second row?

Is there a way to automatically find column names in a csv if they arent the first row? the csv in question has a non-header sentence at the top of the document, then the column names and then the data. I know I could try use header = 1 and skiprow = 0 but this presupposes that I knew the top line wasnt the headers or part of the csv data.

I guess it would be hard to make one since pandas would never know what is and what isnt a header?

Upvotes: 1

Views: 7909

Answers (2)

CodingMatters
CodingMatters

Reputation: 1431

Found this in 2020. pandas.read_csv has improved.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

df = pandas.read_csv("blah.csv", skiprows=1)

It's worth looking at the other options for skipping, at least remember they exist.

skipinitialspace=False, 
skiprows=None, 
skipfooter=0, 
skip_blank_lines=True,

hardest lesson I learned and habit to maintain : always check the api documentation.

Upvotes: 5

Ryan
Ryan

Reputation: 2203

There is no automatic way built into Pandas, but writing a function to calculate the rows to skip isn't too hard. Something like the following would do the trick:

import pandas as pd

def calculate_skiprows(buffer, sep=','):
    pos = buffer.tell()
    lines = pd.DataFrame(buffer.readlines(10000))
    buffer.seek(pos)

    lines['columns'] = lines[0].apply(lambda row: row.count(sep))
    return lines[lines['columns'] < lines['columns'].median()].index.values.tolist()

There are some edge cases this could miss, so you'll want to validate your data. Also, I wrote it to work with a IO buffer, but you could easily modify it to work with a filepath... or both.

Example usage follows:

from io import StringIO

WORDY_CSV = StringIO("""This is a CSV with a verbose intro paragraph.
The paragraph spans lines.
And a blank line seperates it with the header.

Date,Steps,Drinks,Hours Sleep
2018-02-01,9988,3,7.25
2018-02-02,12111,2,7.75
2018-02-03,11321,0,8.25
2018-02-04,9111,1,8.0
2018-02-05,9911,3,7.5
2018-02-06,10911,2,7.75
2018-02-07,12987,0,8.5
2018-02-08,7865,2,7.5
""")

wordy = pd.read_csv(WORDY_CSV, skiprows= calculate_skiprows(WORDY_CSV))
print(wordy)
#----------------------------------------------------------------------
#          Date  Steps  Drinks  Hours Sleep
# 0  2018-02-01   9988       3         7.25
# 1  2018-02-02  12111       2         7.75
# 2  2018-02-03  11321       0         8.25
# 3  2018-02-04   9111       1         8.00
# 4  2018-02-05   9911       3         7.50
# 5  2018-02-06  10911       2         7.75
# 6  2018-02-07  12987       0         8.50
# 7  2018-02-08   7865       2         7.50

Upvotes: 1

Related Questions