Reputation: 1357
tring to parse this file :
https://gist.github.com/anonymous/7714935
that looks like :
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
metadata line
col1 col2 col3 UTCDate UTCTime col6 col7
(m) (MDY) (sec) (weeks) (MDY)
string1- string2- 0.000 11/06/2013 313585.10 1765.00000 11/06/2013
string1- string2- 0.000 11/06/2013 313585.30 1765.00000 11/06/2013
...
i can read it using a code like :
import pandas as pd
import datetime as dt
names=['col1','col2','col3','UTCDate','UTCTime','col6','col7']
def parse(UTCDate,UTCTime):
return dt.datetime.strptime(UTCDate, '%m/%d/%Y') + dt.timedelta(seconds=float(UTCTime))
df = pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date', parse_dates={'date':['UTCDate','UTCTime']}, names=names, date_parser=parse)
Is there a way to avoid to specify the "names" in a list, and try to parse the header instead ?
header is the line :
col1 col2 col3 UTCDate UTCTime col6 col7
line 24 in the gist. I was thinking to keep_date_col=True, header=0 and pass a list of lines to skip like [arange(0,23),25] .. but didn't worked.
Upvotes: 0
Views: 7669
Reputation: 103
I wouldn't call it overly elegant, but you could generalize this problem to dealing with a variable number of rows to skip, with some help from csv
import csv
import pandas as pd
counter=0
testfile=open(filename, 'r')
filereader = csv.reader(testfile, delimiter=r'\s+')
for row in filereader :
counter += 1 #Keep track of row number
if 'col1' in row: #Or whatever string you're looking for as the first row
finalcount=counter
Output=pd.read_csv(filename,sep=r'\s+',skiprows=finalcount-1)
Upvotes: 1
Reputation: 487
Have you tried the header=#
arg for read_csv()
?
Like pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date', parse_dates={'date':['UTCDate','UTCTime']}, header=26, date_parser=parse)
Docs state:
header : int, list of ints Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no
names
passed, otherwiseNone
. Explicitly passheader=0
to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, so header=0 denotes the first line of data rather than the first line of the file.
Upvotes: 0
Reputation: 4979
I think your general approach is ok. Where I think it fails, for me at least, is with the delimiter and the Longitude
/ Latitude
columns. Your delimiter is '\s+' however the data in these columns then looks like three columns rather than one.
Latitude Longitude
41 20 54.57907 -70 38 14.25924
Maybe you could replace all double whitespaces with a tab and use the tab as a delimiter. Alternatively, if you only need the first seven columns, simply remove the remainder before giving it to the dataframe.
Upvotes: 1