epifanio
epifanio

Reputation: 1357

read_csv in pandas - how to use a specific row as header

tring to parse this file :

https://gist.github.com/anonymous/7714935

that looks like :

metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line 
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line

col1   col2         col3    UTCDate   UTCTime       col6       col7   
                           (m)      (MDY)     (sec)    (weeks)      (MDY)     
string1- string2-        0.000 11/06/2013 313585.10 1765.00000 11/06/2013 
string1- string2-        0.000 11/06/2013 313585.30 1765.00000 11/06/2013 
...

i can read it using a code like :

import pandas as pd
import datetime as dt
names=['col1','col2','col3','UTCDate','UTCTime','col6','col7']

def parse(UTCDate,UTCTime):
    return dt.datetime.strptime(UTCDate, '%m/%d/%Y') + dt.timedelta(seconds=float(UTCTime))

df = pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date', parse_dates={'date':['UTCDate','UTCTime']}, names=names, date_parser=parse)

Is there a way to avoid to specify the "names" in a list, and try to parse the header instead ?

header is the line :

col1   col2         col3    UTCDate   UTCTime       col6       col7

line 24 in the gist. I was thinking to keep_date_col=True, header=0 and pass a list of lines to skip like [arange(0,23),25] .. but didn't worked.

Upvotes: 0

Views: 7669

Answers (3)

srhoades10
srhoades10

Reputation: 103

I wouldn't call it overly elegant, but you could generalize this problem to dealing with a variable number of rows to skip, with some help from csv

import csv
import pandas as pd

counter=0
testfile=open(filename, 'r')
filereader = csv.reader(testfile, delimiter=r'\s+')
for row in filereader :
    counter += 1 #Keep track of row number
    if 'col1' in row: #Or whatever string you're looking for as the first row
        finalcount=counter

Output=pd.read_csv(filename,sep=r'\s+',skiprows=finalcount-1)

Upvotes: 1

Pebermynte Lars
Pebermynte Lars

Reputation: 487

Have you tried the header=# arg for read_csv()?

Like pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date', parse_dates={'date':['UTCDate','UTCTime']}, header=26, date_parser=parse)

Docs state:

header : int, list of ints Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

Upvotes: 0

pandita
pandita

Reputation: 4979

I think your general approach is ok. Where I think it fails, for me at least, is with the delimiter and the Longitude / Latitude columns. Your delimiter is '\s+' however the data in these columns then looks like three columns rather than one.

Latitude        Longitude
41 20 54.57907  -70 38 14.25924

Maybe you could replace all double whitespaces with a tab and use the tab as a delimiter. Alternatively, if you only need the first seven columns, simply remove the remainder before giving it to the dataframe.

Upvotes: 1

Related Questions