Reputation: 2014
I have multiple csv-files, with the same rows and columns and their contained data varies depending on the date. Each csv-file is affiliated with a different date, listed in its name, e.g. data.2018-06-01.csv
. A minimal example of my data looks like that: I have the 2 files, data.2018-06-01.csv
and data.2019-06-01.csv
, that respectively contain
user_id, weight, status
001, 70, healthy
002, 90, healthy
and
user_id, weight, status
001, 72, healthy
002, 103, obese
My Question: How can I concatenate the csv-files into a xarray and also define that the coordinates of the xarray are user_id
and date
?
I tried the following code
df_all = []
date_arr = []
for f in [`data.2018-06-01.csv`, `data.2019-06-01.csv`]:
date = f.split('.')[1]
df = pd.read_csv(f)
df_all.append(df)
date_arr.append(date)
x_arr = xr.concat([df.to_xarray() for df in df_all], coords=[date_arr, 'user_id'])
but coords=[...]
leads to an error. What can I do insted? Thanks
Upvotes: 2
Views: 890
Reputation: 83
Recall that although it introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy
-like arrays, xarray
is inspired by and borrows heavily from pandas
. So, to answer the question you can proceed as follows.
from glob import glob
import numpy as np
import pandas as pd
# Get the list of all the csv files in data path
csv_flist = glob(data_path + "/*.csv")
df_list = []
for _file in csv_flist:
# get the file name from the data path
file_name = _file.split("/")[-1]
# extract the date from a file name, e.g. "data.2018-06-01.csv"
date = file_name.split(".")[1]
# read the read the data in _file
df = pd.read_csv(_file)
# add a column date knowing that all the data in df are recorded at the same date
df["date"] = np.repeat(date, df.shape[0])
df["date"] = df.date.astype("datetime64[ns]") # reset date column to a correct date format
# append df to df_list
df_list.append(df)
Let's check e.g. the first df
in df_list
print(df_list[0])
status user_id weight date
0 healthy 1 72 2019-06-01
1 obese 2 103 2019-06-01
Concatenate all the df
s along axis=0
df_all = pd.concat(df_list, ignore_index=True).sort_index()
print(df_all)
status user_id weight date
0 healthy 1 72 2019-06-01
1 obese 2 103 2019-06-01
2 healthy 1 70 2018-06-01
3 healthy 2 90 2018-06-01
Set the index of df_all
to a multiIndex of two levels with levels[0] = "date"
and levels[1]="user_id"
.
data = df_all.set_index(["date", "user_id"]).sort_index()
print(data)
status weight
date user_id
2018-06-01 1 healthy 70
2 healthy 90
2019-06-01 1 healthy 72
2 obese 103
Subsequently, you can convert the resulting pandas.DataFrame
into an xarray.Dataset
using .to_xarray()
as follows.
xds = data.to_xarray()
print(xds)
<xarray.Dataset>
Dimensions: (date: 2, user_id: 2)
Coordinates:
* date (date) datetime64[ns] 2018-06-01 2019-06-01
* user_id (user_id) int64 1 2
Data variables:
status (date, user_id) object 'healthy' 'healthy' 'healthy' 'obese'
weight (date, user_id) int64 70 90 72 103
Which will fully answer the question.
Upvotes: 3
Reputation: 119
Try these:
import glob
import pandas as pd
path=(r'ur file')
all_file = glob.glob(path + "/*.csv")
li = []
for filename in all_file:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Upvotes: 1