How to concatenate multiple csv to xarray and define coordinates?

I have multiple csv-files, with the same rows and columns and their contained data varies depending on the date. Each csv-file is affiliated with a different date, listed in its name, e.g. data.2018-06-01.csv. A minimal example of my data looks like that: I have the 2 files, data.2018-06-01.csv and data.2019-06-01.csv, that respectively contain

user_id, weight, status
001, 70, healthy
002, 90, healthy

and

user_id, weight, status
001, 72, healthy
002, 103, obese

My Question: How can I concatenate the csv-files into a xarray and also define that the coordinates of the xarray are user_id and date?

I tried the following code

df_all = [] 
date_arr = []

for f in [`data.2018-06-01.csv`, `data.2019-06-01.csv`]:
  date = f.split('.')[1]
  df = pd.read_csv(f)
  df_all.append(df)
  date_arr.append(date)

x_arr = xr.concat([df.to_xarray() for df in df_all], coords=[date_arr, 'user_id'])

but coords=[...] leads to an error. What can I do insted? Thanks

Upvotes: 2

Answers (2)

Djeutsch

Reputation: 83

Recall that although it introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, xarray is inspired by and borrows heavily from pandas. So, to answer the question you can proceed as follows.

from glob import glob
import numpy as np
import pandas as pd

# Get the list of all the csv files in data path
csv_flist = glob(data_path + "/*.csv") 

df_list = []
for _file in csv_flist:
    # get the file name from the data path
    file_name = _file.split("/")[-1]
    
    # extract the date from a file name, e.g. "data.2018-06-01.csv"
    date = file_name.split(".")[1]
    
    # read the read the data in _file
    df = pd.read_csv(_file)
    
    # add a column date knowing that all the data in df are recorded at the same date
    df["date"] = np.repeat(date, df.shape[0])
    df["date"] = df.date.astype("datetime64[ns]") # reset date column to a correct date format
    
    # append df to df_list
    df_list.append(df)

Let's check e.g. the first df in df_list

print(df_list[0])

    status  user_id  weight       date
0  healthy        1      72 2019-06-01
1    obese        2     103 2019-06-01

Concatenate all the dfs along axis=0

df_all = pd.concat(df_list, ignore_index=True).sort_index()
print(df_all)

    status  user_id  weight       date
0  healthy        1      72 2019-06-01
1    obese        2     103 2019-06-01
2  healthy        1      70 2018-06-01
3  healthy        2      90 2018-06-01

Set the index of df_all to a multiIndex of two levels with levels[0] = "date" and levels[1]="user_id".

data = df_all.set_index(["date", "user_id"]).sort_index()
print(data)

                     status  weight
date       user_id                 
2018-06-01 1        healthy      70
           2        healthy      90
2019-06-01 1        healthy      72
           2          obese     103

Subsequently, you can convert the resulting pandas.DataFrame into an xarray.Dataset using .to_xarray() as follows.

xds = data.to_xarray()
print(xds)

<xarray.Dataset>
Dimensions:  (date: 2, user_id: 2)
Coordinates:
  * date     (date) datetime64[ns] 2018-06-01 2019-06-01
  * user_id  (user_id) int64 1 2
Data variables:
    status   (date, user_id) object 'healthy' 'healthy' 'healthy' 'obese'
    weight   (date, user_id) int64 70 90 72 103

Which will fully answer the question.

Upvotes: 3

hula-hula

Reputation: 119

Try these:

    import glob
    import pandas as pd

    path=(r'ur file')
    all_file = glob.glob(path + "/*.csv")
    li = []
    for filename in all_file:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)
    frame = pd.concat(li, axis=0, ignore_index=True)

Upvotes: 1

How to concatenate multiple csv to xarray and define coordinates?

Answers (2)

Related Questions