John Conor
John Conor

Reputation: 884

How to convert data from wide to long so values are plotted against time

I have a time series data set of multiple IDs and multiple variables, each variable has 3 time series entries - "baseline", "3 month", "6 month". The dataframe is structured like this, df =

import pandas as pd

data = {'Patient ID': [11111, 11111, 11111, 11111, 22222, 22222, 22222, 22222, 33333, 33333, 33333, 33333, 44444, 44444, 44444, 44444, 55555, 55555, 55555, 55555],
        'Lab Attribute': ['% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)', '% Saturation- Iron', 'ALK PHOS', 'ALT(SGPT)', 'AST (SGOT)'],
        'Baseline': [46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0, 46.0, 94.0, 21.0, 18.0],
        '3 Month': [23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0, 23.0, 82.0, 13.0, 17.0],
        '6 Month': [34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0, 34.0, 65.0, 10.0, 14.0]}
df = pd.DataFrame(data)

    Patient ID       Lab Attribute  Baseline  3 Month  6 Month
0        11111  % Saturation- Iron      46.0     23.0     34.0
1        11111            ALK PHOS      94.0     82.0     65.0
2        11111           ALT(SGPT)      21.0     13.0     10.0
3        11111          AST (SGOT)      18.0     17.0     14.0
4        22222  % Saturation- Iron      46.0     23.0     34.0
5        22222            ALK PHOS      94.0     82.0     65.0
6        22222           ALT(SGPT)      21.0     13.0     10.0
7        22222          AST (SGOT)      18.0     17.0     14.0
8        33333  % Saturation- Iron      46.0     23.0     34.0
9        33333            ALK PHOS      94.0     82.0     65.0
10       33333           ALT(SGPT)      21.0     13.0     10.0
11       33333          AST (SGOT)      18.0     17.0     14.0
12       44444  % Saturation- Iron      46.0     23.0     34.0
13       44444            ALK PHOS      94.0     82.0     65.0
14       44444           ALT(SGPT)      21.0     13.0     10.0
15       44444          AST (SGOT)      18.0     17.0     14.0
16       55555  % Saturation- Iron      46.0     23.0     34.0
17       55555            ALK PHOS      94.0     82.0     65.0
18       55555           ALT(SGPT)      21.0     13.0     10.0
19       55555          AST (SGOT)      18.0     17.0     14.0

What I'm trying to do is group by the ID and Lab Attribute and create a plot of each "Lab Attribute" - % Saturation- Iron, ALK PHOS, etc., that will include the time series for all of the Patient IDs.

So based on the example data there would be 4 plots - % Saturation- Iron, ALK PHOS, etc., that each would contain 5 traces (1 for each ID).

I tried using groupby per this article - Creating a time-series plot with data in long format in python?

Though it just plots everything on one chart.

This is the code I have so far:

df_labs = pd.read_csv("/Users/johnconor/Documents/Python/gut_microbiome/out/nw_labs_up_to_6mon.csv")
df_labs = df_labs.fillna(method='ffill')

dfl = df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().plot()

This was the outcome:

[![enter image description here][1]][1]

Part of the problem I'm having is that all of the examples I can find have long format data with only 1 values column. Not a number of values over time.

I also tried to utilize the approach for multiple plots per this article - Creating a time-series plot with data in long format in python?

n_ids = df_labs.Patient_ID.unique().size
n_cols = int(n_ids ** 0.5)
n_rows = n_cols + (1 if n_ids % n_cols else 0)                   
fig, axes = plt.subplots(n_rows, n_cols)
axes = axes.ravel()
for i, (id, att, base,three,six) in enumerate(df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().reset_index()):
    print(idx)
    series.plot(ax=axes[i], title=f"ID:{idx}")
fig.tight_layout()

Though am running into issues as it is again designed for only 1 set of values. Producing the error:

ValueError: too many values to unpack (expected 5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-42cf5bc14bdb> in <module>
      4 fig, axes = plt.subplots(n_rows, n_cols)
      5 axes = axes.ravel()
----> 6 for i, (id, att, base,three,six) in enumerate(df_labs.groupby(['Patient_ID', 'Lab_Attribute'])['Baseline','3 Month','6 Month'].sum().reset_index()):
      7     print(idx)
      8     series.plot(ax=axes[i], title=f"ID:{idx}")

ValueError: too many values to unpack (expected 5)

Upvotes: 2

Views: 304

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62513

  • The shape of the dataframe should be changed to a long form using .melt, which will allow the month to be used as a time axis.
  • It will be easiest to use seaborn.relplot with kind='line', to create the visualization.
    • Change col, row, and/or hue to adjust how the data should be grouped. Do not change x and y.
  • To prevent sharing y, see Prevent Sharing of Y Axes in Seaborn Relplot
import pandas as pd
import seaborn as sns

# reshape the dataframe
dfm = df.melt(id_vars=['Patient ID', 'Lab Attribute'], var_name='Months')

# change the Months values to numeric
dfm.Months = dfm.Months.map({'Baseline': 0, '3 Month': 3, '6 Month': 6})

# display(dfm.head())
   Patient ID       Lab Attribute  Months  value
0       11111  % Saturation- Iron       0   46.0
1       11111            ALK PHOS       0   94.0
2       11111           ALT(SGPT)       0   21.0
3       11111          AST (SGOT)       0   18.0
4       22222  % Saturation- Iron       0   46.0

# plot a figure level line plot with seaborn
p = sns.relplot(data=dfm, col='Lab Attribute', x='Months', y='value', hue='Patient ID', kind='line', col_wrap=4, marker='o', palette='husl')
  • The lines are stacked because the data values are all the same

enter image description here

p = sns.catplot(data=dfm, col='Lab Attribute', x='Months', y='value', hue='Patient ID', kind='bar', col_wrap=4)

enter image description here

Upvotes: 3

Related Questions