Reputation: 16987
First off, here is what my .xlsx timeseries data looks like:
and here is how I'm reading it:
def loaddata(filepaths):
t1 = time.clock()
for i in range(len(filepaths)):
xl = pd.ExcelFile(filepaths[i])
df = xl.parse(xl.sheet_names[0], header=0, index_col=2, skiprows=[0,2,3,4], parse_dates=True)
df = df.dropna(axis=1, how='all')
df = df.drop(['Decimal Year Day', 'Decimal Year Day.1', 'RECORD'], axis=1)
df.index = pd.DatetimeIndex(((df.index.asi8/(1e9*60)).round()*1e9*60).astype(np.int64)).values
if i == 0:
dfs = df
else:
dfs = concat([dfs, df], axis=1)
t2 = time.clock()
print "Files loaded into dataframe in %s seconds" %(t2-t1)
return dfs
files = ["London Lysimeters corrected 5min.xlsx"]
data = loaddata(files)
What I need to be able to do is read the column labels AND units (row 2 and 3) as well as the values into a pandas dataframe, and be able to access the labels and units row as a list of strings. I can't seem to figure out how to load up both row 2 and 3 and have the time read in correctly into pandas datetimeindex, but it works fine if I only upload the labels. Also I've looked everywhere and can't figure out how to get the column headers as a list.
I would appreciate it if anyone could help with either of these issues.
Upvotes: 1
Views: 3702
Reputation: 28926
First of all, get rid of that for i in range(len(filepaths))
! The pythonic way is for i, filepath in enumerate(filepaths)
. enumerate
gives a tuple so you can say ExcelFile(filepath)
instead of ExcelFile(filepaths[i])
.
I think your two problems are related. If I'm reading your code correctly, when you include row 2 and 3 the dates can't be parsed since the timestamp column isn't homogenous. It's not all timestamps.
You could use a Hierarchical index to get the data in (column, label, unit)
format. It's probably easiest to first read in just the header information. Then read the data separately and set the columns after the fact (I don't have excel handy right now, but I think all the read_csv
options I use are available to xlrd
also):
In [7]: df_header = pd.read_csv('test.csv', nrows=2, index_col='three')
In [8]: df_header
Out[8]:
one two four
three
Timestamp Decimal Decimal record
ts ref ref rn
In [9]: df_data = pd.read_csv('test.csv', names=df_header.columns,
...: skiprows=4, parse_dates=True, index_col=2)
In [10]: df_data
Out[10]:
one two four
2012-08-29 07:10:00 32.1 32.0 232
2012-08-29 09:10:00 1.1 1.2 233
In [11]: cols = pd.MultiIndex.from_tuples([tuple([x] + df_header[x].tolist())
....: for x in df_header])
In [12]: cols
Out[12]:
MultiIndex
[one Decimal ref, two Decimal ref, four record rn ]
In [14]: df_data.columns = cols
In [15]: df_data
Out[15]:
one two four
Decimal Decimal record
ref ref rn
2012-08-29 07:10:00 32.1 32.0 232
2012-08-29 09:10:00 1.1 1.2 233
This should get you to the point in your code where you start dropping columns and start concatenating. Also take a look at the developers docs. It looks like the syntax for reading excel files is getting cleaned up (much nicer!). You might be able to use the parse_cols
argument with a list of ints to avoid dropping columns later.
Oh and you can get the list of strings with df_data.columns.tolist()
Upvotes: 1