Reputation: 357
I am importing a CSV file to a Pandas dataframe. The CSV file is something like:
Time, Status, Variable, freq_1, freq_2, freq_3, .....
1/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
1/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
1/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
1/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
2/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
2/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
2/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
2/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
....
I read it into a dataframe with a multi-index using Time, Status and Variable as indicies.
I would now like to import the dataframe into Xarray using Pandas to_xarray or Xarray from_dataframe. However, both of these methods appear to choke on the index, throwing the error:
TypeError: Could not convert tuple of form (dims, data[, attrs, encoding]): (0, DatetimeIndex(['2018-01-12 00:15:00', '2018-01-12 00:45:00',
'2018-01-12 01:15:00', '2018-01-12 01:45:00',
'2018-01-12 02:15:00', '2018-01-12 02:45:00',
'2018-01-12 03:15:00', '2018-01-12 03:45:00',
'2018-01-12 04:15:00', '2018-01-12 04:45:00',
...
'2019-11-01 16:15:00', '2019-11-01 17:15:00',
'2019-11-01 17:45:00', '2019-11-01 18:15:00',
'2019-11-01 18:45:00', '2019-11-01 19:15:00',
'2019-11-01 20:45:00', '2019-11-01 21:15:00',
'2019-11-01 21:45:00', '2019-11-01 22:15:00'],
dtype='datetime64[ns]', name=0, length=3100, freq=None)) to Variable.
I have also tried using the Xarray.DataArray procedure:
Mytime = PD.to_datetime(df[0],infer_datetime_format=True)
Myfreq = np.array([ 0,1,2,3...39])
OutDataArray = Xarray.DataArray(df[np.arange(3,43)], coords=[('time', Mytime ), ('freq', Myfreq ), ('Status', df[1]), ('variable', df[2])])
but this gave the error:
ValueError: coords is not dict-like, but it has 4 items, which does not match the 2 dimensions of the data
So, how does one import a Pandas dataframe into Xarray if the dataframe is 2D, but one of those dimensions (i.e. the rows) actually consists of multiple dimensions specified by the multi-index of the dataframe?
As requested, here is an example script that reproduces the problem. Note you will need to set a filename for the CSV file of the example data that gets imported:
import numpy as np
import pandas as PD
# create some data
dt = PD.date_range(start='01/01/2000 00:00:00', end='02/01/2000 00:00:00', freq='30T')
ldt = len(dt)
vr1 = PD.Series(np.empty(ldt, dtype = np.str))
vr2 = PD.Series(np.empty(ldt, dtype = np.str))
vr3 = PD.Series(np.empty(ldt, dtype = np.str))
vr1.values[:] = 'apple'
vr2.values[:] = 'orange'
vr3.values[:] = 'peach'
# combine the data to mimic my file format
a = PD.Series([1,2,3,4], index=[7,2,8,9])
b = PD.Series([5,6,7,8], index=[7,2,8,9])
df1 = PD.DataFrame({'Time': dt,'Fruit':vr1, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df2 = PD.DataFrame({'Time': dt,'Fruit':vr2, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df3 = PD.DataFrame({'Time': dt,'Fruit':vr3, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df_unsorted = PD.concat([df1, df2, df3])
df = df_unsorted.sort_values(by=['Time', 'Fruit'])
# write the data to a csv file
filename = 'Give a file path/name here'
df.to_csv(filename, index=False)
#import the csv file and convert to an xarray
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
da = df2.to_xarray()
Upvotes: 0
Views: 4953
Reputation: 1175
Your error seems to lie in the columns and indices from your csv file not being named in the resulting DataFrame. Replacing the last two lines of your code example with:
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
df2.columns = ['N1', 'N2', 'N3']
df2.index.names = ['time', 'fruit']
ds = df2.to_xarray()
Results in a successful conversion to an xarray Dataset.
print(ds)
<xarray.Dataset>
Dimensions: (fruit: 3, time: 1489)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ... 2000-02-01
* fruit (fruit) object 'apple' 'orange' 'peach'
Data variables:
N1 (time, fruit) float64 0.114 0.3726 0.5072 ... 0.2065 0.9082 0.7945
N2 (time, fruit) float64 0.7534 0.1107 0.8866 ... 0.4509 0.5218 0.1472
N3 (time, fruit) float64 0.156 0.6498 0.3521 ... 0.3742 0.5899 0.607
Update: you can skip manually setting column and index names by removing the skiprows=1
and header=None
arguments in PD.read_csv()
, getting the column names from the csv header. So your last two lines look like:
df2 = PD.read_csv(filename, sep=',', skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
ds = df2.to_xarray()
Upvotes: 1