Obtain list of strings from column index of pandas dataframe

Question

First off, here is what my .xlsx timeseries data looks like:

What the data looks like in excel

and here is how I'm reading it:

def loaddata(filepaths):
    t1 = time.clock()
    for i in range(len(filepaths)):
        xl = pd.ExcelFile(filepaths[i])
        df = xl.parse(xl.sheet_names[0], header=0, index_col=2, skiprows=[0,2,3,4], parse_dates=True)
        df = df.dropna(axis=1, how='all') 
        df = df.drop(['Decimal Year Day', 'Decimal Year Day.1', 'RECORD'], axis=1)
        df.index = pd.DatetimeIndex(((df.index.asi8/(1e9*60)).round()*1e9*60).astype(np.int64)).values

        if i == 0:
            dfs = df
        else:
            dfs = concat([dfs, df], axis=1)

    t2 = time.clock()
    print "Files loaded into dataframe in %s seconds" %(t2-t1)

    return dfs

files = ["London Lysimeters corrected 5min.xlsx"]
data = loaddata(files)

What I need to be able to do is read the column labels AND units (row 2 and 3) as well as the values into a pandas dataframe, and be able to access the labels and units row as a list of strings. I can't seem to figure out how to load up both row 2 and 3 and have the time read in correctly into pandas datetimeindex, but it works fine if I only upload the labels. Also I've looked everywhere and can't figure out how to get the column headers as a list.

I would appreciate it if anyone could help with either of these issues.

TomAugspurger · Accepted Answer

First of all, get rid of that for i in range(len(filepaths))! The pythonic way is for i, filepath in enumerate(filepaths). enumerate gives a tuple so you can say ExcelFile(filepath) instead of ExcelFile(filepaths[i]).

I think your two problems are related. If I'm reading your code correctly, when you include row 2 and 3 the dates can't be parsed since the timestamp column isn't homogenous. It's not all timestamps.

You could use a Hierarchical index to get the data in (column, label, unit) format. It's probably easiest to first read in just the header information. Then read the data separately and set the columns after the fact (I don't have excel handy right now, but I think all the read_csv options I use are available to xlrd also):

In [7]: df_header = pd.read_csv('test.csv', nrows=2, index_col='three')

In [8]: df_header
Out[8]: 
               one      two    four
three                              
Timestamp  Decimal  Decimal  record
ts             ref      ref      rn

In [9]: df_data = pd.read_csv('test.csv', names=df_header.columns,
   ...:                       skiprows=4, parse_dates=True, index_col=2)

In [10]: df_data
Out[10]: 
                      one   two  four
2012-08-29 07:10:00  32.1  32.0   232
2012-08-29 09:10:00   1.1   1.2   233

In [11]: cols = pd.MultiIndex.from_tuples([tuple([x] + df_header[x].tolist())
   ....:                                   for x in df_header])

In [12]: cols
Out[12]: 
MultiIndex
[one   Decimal  ref, two   Decimal  ref, four  record   rn ]

In [14]: df_data.columns = cols

In [15]: df_data
Out[15]: 
                         one      two    four
                     Decimal  Decimal  record
                         ref      ref      rn
2012-08-29 07:10:00     32.1     32.0     232
2012-08-29 09:10:00      1.1      1.2     233

This should get you to the point in your code where you start dropping columns and start concatenating. Also take a look at the developers docs. It looks like the syntax for reading excel files is getting cleaned up (much nicer!). You might be able to use the parse_cols argument with a list of ints to avoid dropping columns later.

Oh and you can get the list of strings with df_data.columns.tolist()

Obtain list of strings from column index of pandas dataframe

Answers (1)

Related Questions