amaity
amaity

Reputation: 267

Convert csv file to pandas dataframe

I have a CSV file in the following format:

DATES, 01-12-2010, 01-12-2010, 01-12-2010, 02-12-2010, 02-12-2010, 02-12-2010
UNITS, Hz, kV, MW,  Hz, kV, MW
Interval, , , , , ,                                           
00:15, 49.82, 33.73755, 34.65, 49.92, 33.9009, 36.33,
00:30, 49.9, 33.7722, 35.34, 49.89, 33.8382, 37.65,
00:45, 49.94, 33.8316, 33.5, 50.09, 34.07745, 37.41,
01:00, 49.86, 33.94875, 30.91, 50.18,   34.20945, 36.11,
01:15, 49.97, 34.2243,  27.28,  50.11,  34.3596, 33.24,
01:30,  50.02,  34.3332, 26.91, 50.12,  34.452, 31.03,
01:45,  50.01,  34.1286, 31.26, 50, 33.9306, 38.86,
02:00,  50.08,  33.9141, 34.96, 50.14,  33.99165, 38.31,
02:15,  50.07,  33.84975, 35.33, 50.01, 33.9537, 39.78,
02:30,  49.97,  34.0263, 33.63, 50.07,  33.8547, 41.48,

I would like to convert the above to a dataframe in the following format:

                    Hz      kV          MW
DATES_Interval
01-12-2010 00:15    49.82   33.73755    34.65
01-12-2010 00:30    49.9    33.7722     35.34
01-12-2010 00:45    49.94   33.8316     33.5
01-12-2010 01:00    49.86   33.94875    30.91
01-12-2010 01:15    49.97   34.2243     27.28
01-12-2010 01:30    50.02   34.3332     26.91
01-12-2010 01:45    50.01   34.1286     31.26
01-12-2010 02:00    50.08   33.9141     34.96
01-12-2010 02:15    50.07   33.84975    35.33
01-12-2010 02:30    49.97   34.0263     33.63
02-12-2010 00:15    49.92   33.9009     36.33
02-12-2010 00:30    49.89   33.8382     37.65
02-12-2010 00:45    50.09   34.07745    37.41
02-12-2010 01:00    50.09   34.07745    37.41
02-12-2010 01:15    50.11   34.3596     33.24
02-12-2010 01:30    50.12   34.452      31.03
02-12-2010 01:45    50      33.9306     38.86
02-12-2010 02:00    50.14   33.99165    38.31
02-12-2010 02:15    50.01   33.9537     39.78
02-12-2010 02:30    50.07   33.8547     41.48

How do I do this with pandas?

Upvotes: 3

Views: 21840

Answers (2)

Aman
Aman

Reputation: 47179

Another solution would be to

  1. Read the dates (first row of csv)
  2. Read in the rest of the data, including intervals
  3. Construct the index as desired and apply it to the dataframe

Here's some sample code:

In [1]: from StringIO import StringIO    
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.10.1'

In [4]: CSV_SAMPLE = """
DATES, 01-12-2010, 01-12-2010, 01-12-2010, 02-12-2010, 02-12-2010, 02-12-2010
UNITS, Hz, kV, MW,  Hz, kV, MW
Interval, , , , , ,
00:15, 49.82, 33.73755, 34.65, 49.92, 33.9009, 36.33,
00:30, 49.9, 33.7722, 35.34, 49.89, 33.8382, 37.65,
00:45, 49.94, 33.8316, 33.5, 50.09, 34.07745, 37.41,
01:00, 49.86, 33.94875, 30.91, 50.18,   34.20945, 36.11,
01:15, 49.97, 34.2243,  27.28,  50.11,  34.3596, 33.24,
01:30,  50.02,  34.3332, 26.91, 50.12,  34.452, 31.03,
"""

#Create one dataframe from just the dates (and we'll grab the units, too)
In [6]: datesdf = pd.read_csv(StringIO(CSV_SAMPLE), nrows= 2)
In [7]: dates, units = datesdf.index.droplevel()

In [9]: dates, units
Out[9]:
((' 01-12-2010',
  ' 01-12-2010',
  ' 01-12-2010',
  ' 02-12-2010',
  ' 02-12-2010',
  ' 02-12-2010'),
 (' Hz', ' kV', ' MW', '  Hz', ' kV', ' MW'))

#Create a second dataframe from the rest of the data
In [11]: data = pd.read_csv(StringIO(CSV_SAMPLE), skiprows=3)
In [12]: data = data.icol([0,1,2])

#Note: Instead, in pandas 0.10, you can use the usecols paramater in read_csv()
#  to combine the above two steps into one. 

In [14]: data.columns = units[:3]
In [15]: print data
          Hz        kV     MW
00:15  49.82  33.73755  34.65
00:30  49.90  33.77220  35.34
00:45  49.94  33.83160  33.50
01:00  49.86  33.94875  30.91
01:15  49.97  34.22430  27.28
01:30  50.02  34.33320  26.91

Now create the desired index and apply it. Here are several approaches for the index.

#We'll need to grab the intervals from this data df
In [16]: intervals = data.index.tolist()  

In [17]: index1 = pd.MultiIndex.from_arrays([dates,intervals])

#This is a multi-index
In [18]: print index1
MultiIndex
[( 01-12-2010, 00:15), ( 01-12-2010, 00:30), ( 01-12-2010, 00:45), ( 02-12-2010, 01:00), ( 02-12-2010, 01:15), ( 02-12-2
010, 01:30)]

#This index is a tuple of date,interval
In [21]: index2 = pd.Index(zip(dates, intervals))
In [22]: print index2
Index([( 01-12-2010, 00:15), ( 01-12-2010, 00:30), ( 01-12-2010, 00:45), ( 02-12-2010, 01:00), ( 02-12-2010, 01:15), ( 0
2-12-2010, 01:30)], dtype=object)

#This index is based on a string concat of date and interval
In [23]: def list_join(x,y):
   ....:         joined = x + ' ' + y
   ....:         return joined.strip()
   ....:

In [24]: index3 = pd.Index(map(list_join, dates, intervals))    
In [25]: print index3
Simple index:
Index([01-12-2010 00:15, 01-12-2010 00:30, 01-12-2010 00:45, 02-12-2010 01:00, 02-12-2010 01:15, 02-12-2010 01:30], dtyp
e=object)

Since third third type of index is what you had in your original request, I'll use that.

In [26]: data.index = index3
In [27]: print data
                     Hz        kV     MW
01-12-2010 00:15  49.82  33.73755  34.65
01-12-2010 00:30  49.90  33.77220  35.34
01-12-2010 00:45  49.94  33.83160  33.50
02-12-2010 01:00  49.86  33.94875  30.91
02-12-2010 01:15  49.97  34.22430  27.28
02-12-2010 01:30  50.02  34.33320  26.91

You may have to modify the code above to handle the complete dataset if it complains about non-unique index values. In that case, read in the Intervals csv column as a data column (not an index) and pull it out as an array to create the desired index, same as above.

Upvotes: 1

Andy Hayden
Andy Hayden

Reputation: 375415

The key to doing this kind of thing in pandas is the stack() method:

df.stack(level=0)

However, I found getting to a place where you can use this, at least particular csv is tricky.. to say the least (there is almost certainly a nicer way to do this!):

df_data = pd.read_csv('e.csv', sep=',\s+', header=None, skiprows=3)[range(7)].set_index(0)
df_cols = pd.read_csv('e.csv', sep=',\s+', header=None, nrows=2).set_index(0)[:2] #interval causing problems    
df_ = df_cols.append(df_data).T.set_index(['DATES','UNITS','Interval']).T
df = df_.stack(level=0)
df_dates = map(lambda x: pd.to_datetime(' '.join(x[::-1])), df.index)
df.index = df_dates

In [7]: df
Out[7]: 
UNITS                   Hz      MW        kV
2010-01-12 00:15:00  49.82   34.65  33.73755
2010-02-12 00:15:00  49.92  36.33,   33.9009
2010-01-12 00:30:00   49.9   35.34   33.7722
2010-02-12 00:30:00  49.89  37.65,   33.8382
2010-01-12 00:45:00  49.94    33.5   33.8316
2010-02-12 00:45:00  50.09  37.41,  34.07745
2010-01-12 01:00:00  49.86   30.91  33.94875
2010-02-12 01:00:00  50.18  36.11,  34.20945
2010-01-12 01:15:00  49.97   27.28   34.2243
2010-02-12 01:15:00  50.11  33.24,   34.3596
2010-01-12 01:30:00  50.02   26.91   34.3332
2010-02-12 01:30:00  50.12  31.03,    34.452
2010-01-12 01:45:00  50.01   31.26   34.1286
2010-02-12 01:45:00     50  38.86,   33.9306
2010-01-12 02:00:00  50.08   34.96   33.9141
2010-02-12 02:00:00  50.14  38.31,  33.99165
2010-01-12 02:15:00  50.07   35.33  33.84975
2010-02-12 02:15:00  50.01  39.78,   33.9537
2010-01-12 02:30:00  49.97   33.63   34.0263
2010-02-12 02:30:00  50.07  41.48,   33.8547

That's a little messy and has commas in some columns!:

def clean(s):
    try: return float(s.strip(','))
    except: return s

In [9]: df.applymap(clean)
Out[9]: 
                        Hz     MW        kV
2010-01-12 00:15:00  49.82  34.65  33.73755
2010-02-12 00:15:00  49.92  36.33  33.90090
2010-01-12 00:30:00  49.90  35.34  33.77220
2010-02-12 00:30:00  49.89  37.65  33.83820
2010-01-12 00:45:00  49.94  33.50  33.83160
2010-02-12 00:45:00  50.09  37.41  34.07745
2010-01-12 01:00:00  49.86  30.91  33.94875
2010-02-12 01:00:00  50.18  36.11  34.20945
2010-01-12 01:15:00  49.97  27.28  34.22430
2010-02-12 01:15:00  50.11  33.24  34.35960
2010-01-12 01:30:00  50.02  26.91  34.33320
2010-02-12 01:30:00  50.12  31.03  34.45200
2010-01-12 01:45:00  50.01  31.26  34.12860
2010-02-12 01:45:00  50.00  38.86  33.93060
2010-01-12 02:00:00  50.08  34.96  33.91410
2010-02-12 02:00:00  50.14  38.31  33.99165
2010-01-12 02:15:00  50.07  35.33  33.84975
2010-02-12 02:15:00  50.01  39.78  33.95370
2010-01-12 02:30:00  49.97  33.63  34.02630
2010-02-12 02:30:00  50.07  41.48  33.85470

Upvotes: 3

Related Questions