HM14
HM14

Reputation: 709

Python Pandas, Reading in file and skipping rows ahead of header

I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?

For example, some files require this:

f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])

And some require this:

f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])

Here is my chunk of code looping over my files:

for name,ID in stations:
    path = str(ID)+'/*.csv'
    for fname in glob.glob(path):
        print(fname)
        f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
        ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
        dt = f['Date/Time']

Upvotes: 2

Views: 1901

Answers (2)

HM14
HM14

Reputation: 709

The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:

for name,ID in stations:
    #read in each stations .csv files, concatenate together, insert station id column
    path = str(ID)+'/*.csv'
    for fname in glob.glob(path):
        print(fname)
        try:
            f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
        except:
            f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
        ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
        dt = f['Date/Time']

This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.

Upvotes: 0

jpp
jpp

Reputation: 164683

One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.

This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.

from io import StringIO
import pandas as pd
from copy import copy

mystr = StringIO("""dasfaf
kgafsda


Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")

mystr2 = copy(mystr)

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))

# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])

print(df)

   Date/Time  num1  num2
0 2018-01-01     0     1
1 2018-01-02     2     3

Wrap this in a function if you need to repeat the task:

def calc_skiprows(fname):
    with fname as fin:
        idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1

df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])

Upvotes: 1

Related Questions