Reputation: 81
Is there a way to automatically find column names in a csv if they arent the first row? the csv in question has a non-header sentence at the top of the document, then the column names and then the data. I know I could try use header = 1 and skiprow = 0 but this presupposes that I knew the top line wasnt the headers or part of the csv data.
I guess it would be hard to make one since pandas would never know what is and what isnt a header?
Upvotes: 1
Views: 7909
Reputation: 1431
Found this in 2020. pandas.read_csv has improved.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
df = pandas.read_csv("blah.csv", skiprows=1)
It's worth looking at the other options for skipping, at least remember they exist.
skipinitialspace=False,
skiprows=None,
skipfooter=0,
skip_blank_lines=True,
hardest lesson I learned and habit to maintain : always check the api documentation.
Upvotes: 5
Reputation: 2203
There is no automatic way built into Pandas, but writing a function to calculate the rows to skip isn't too hard. Something like the following would do the trick:
import pandas as pd
def calculate_skiprows(buffer, sep=','):
pos = buffer.tell()
lines = pd.DataFrame(buffer.readlines(10000))
buffer.seek(pos)
lines['columns'] = lines[0].apply(lambda row: row.count(sep))
return lines[lines['columns'] < lines['columns'].median()].index.values.tolist()
There are some edge cases this could miss, so you'll want to validate your data. Also, I wrote it to work with a IO buffer, but you could easily modify it to work with a filepath... or both.
Example usage follows:
from io import StringIO
WORDY_CSV = StringIO("""This is a CSV with a verbose intro paragraph.
The paragraph spans lines.
And a blank line seperates it with the header.
Date,Steps,Drinks,Hours Sleep
2018-02-01,9988,3,7.25
2018-02-02,12111,2,7.75
2018-02-03,11321,0,8.25
2018-02-04,9111,1,8.0
2018-02-05,9911,3,7.5
2018-02-06,10911,2,7.75
2018-02-07,12987,0,8.5
2018-02-08,7865,2,7.5
""")
wordy = pd.read_csv(WORDY_CSV, skiprows= calculate_skiprows(WORDY_CSV))
print(wordy)
#----------------------------------------------------------------------
# Date Steps Drinks Hours Sleep
# 0 2018-02-01 9988 3 7.25
# 1 2018-02-02 12111 2 7.75
# 2 2018-02-03 11321 0 8.25
# 3 2018-02-04 9111 1 8.00
# 4 2018-02-05 9911 3 7.50
# 5 2018-02-06 10911 2 7.75
# 6 2018-02-07 12987 0 8.50
# 7 2018-02-08 7865 2 7.50
Upvotes: 1