user2064809
user2064809

Reputation: 403

Visualise missing values in a time series heatmap

I am really new in big data analysing. Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage. I grouped the data base on 'id' and 'hour'

I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green. The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.

    id    time                   fuel
0   1     2022-02-26 19:08:33    100
2   1     2022-02-26 20:09:35    None
3   2     2022-02-26 21:09:35    70
4   3     2022-02-26 21:10:55    60
5   4     2022-02-26 21:10:55    None
6   5     2022-02-26 22:12:43    50
7   6     2022-02-26 23:10:50    None

So for example, in the following code I computed the percentage of the missing value for every hour for specific id:

df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)

Is there any solution?

Upvotes: 2

Views: 2306

Answers (2)

tdy
tdy

Reputation: 41407

Update: The heatmap now plots id vs time vs percentage of null fuel. I've kept my original answer for id vs time vs fuel at the end of this post.


I want something almost like a github style calendar.

To mimic the GitHub contribution matrix, reset the grouped null percentages into a dataframe and pivot into 1 id per row and 1 hour per column. Then use sns.heatmap to color each cell based on percentage of null fuel.

# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# compute null percentage per (id, hour)
nulls = (df.set_index('time')
           .groupby(['id', pd.Grouper(freq='H')])['fuel']
           .apply(lambda x: x.isnull().mean() * 100))

# pivot into id vs time matrix
matrix = (nulls.reset_index(name='null (%)')
               .pivot(index='id', columns='time', values='null (%)'))

# plot time series heatmap
sns.heatmap(matrix, square=True, vmin=0, vmax=100, cmap='magma_r', cbar_kws={'label': 'null (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))


Original: This is for visualizing id by time by fuel:

  1. Pivot into an id vs time matrix. Normally pivot is fine, but since your real data contains duplicate indexes, use pivot_table.
  2. resample the time columns into hourly means.
  3. Plot the time series matrix using sns.heatmap.
# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# pivot into id vs time matrix
matrix = df.pivot_table(index='id', columns='time', values='fuel', dropna=False)

# resample columns into hourly means
matrix = matrix.resample('H', axis=1).mean()

# plot time series heatmap
sns.heatmap(matrix, square=True, cmap='plasma_r', vmin=0, vmax=100, cbar_kws={'label': 'fuel (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))

Upvotes: 2

Zelemist
Zelemist

Reputation: 640

There is no right answer concerning missing values visualization, I guess it depends on your uses, habits ...

But first, to make it works, we need to preprocess your dataframe and make it analyzable, aka ensure its dtypes.

First let's build our data :

import pandas as pd
from io import StringIO
    
csvfile = StringIO(
"""id   time    fuel
1   2022-02-26 19:08:33 100
2   2022-02-26 19:09:35 70
3   2022-02-26 19:10:55 60
4   2022-02-26 20:10:55 None
5   2022-02-26 21:12:43 50
6   2022-02-26 22:10:50 None""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')

df
Out[65]: 
   id                 time  fuel
0   1  2022-02-26 19:08:33   100
1   2  2022-02-26 19:09:35    70
2   3  2022-02-26 19:10:55    60
3   4  2022-02-26 20:10:55  None
4   5  2022-02-26 21:12:43    50
5   6  2022-02-26 22:10:50  None

At this stage almost all data in our dataframe is string related, you need to convert fuel and time into a non-object dtypes.

df.dtypes
Out[66]: 
id       int64
time    object
fuel    object
dtype: object

Time should be converted as datetime, id as int and fuel as float. Indeed, None should be convert as np.nan for numeric values, which needs the float dtype.

With a map, we can easily change all 'None' values into np.nan. I won't go deeper here, but for simplicity sake, I'll use a custom subclass of dict with a __missing__ implementation

df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")

class dict_with_missing(dict):
    def __missing__(self, key):
        return key
map_dict = dict_with_missing({'None' : np.nan})
df.fuel = df.fuel.map(map_dict).astype(np.float32)

Then we have a clean dataframe :

df
Out[68]: 
   id                time   fuel
0   1 2022-02-26 19:08:33  100.0
1   2 2022-02-26 19:09:35   70.0
2   3 2022-02-26 19:10:55   60.0
3   4 2022-02-26 20:10:55    NaN
4   5 2022-02-26 21:12:43   50.0
5   6 2022-02-26 22:10:50    NaN

df.dtypes
Out[69]: 
id               int64
time    datetime64[ns]
fuel           float32
dtype: object

Then, you can easily use bar, matrix or heatmap from the missingno module

msno.bar(df)
msno.matrix(df, sparkline=False)
msno.heatmap(df, cmap="RdYlGn")

A side note here, heatmap is kind of useless here, since it compares columns having missing values. And you only have one column with missing value. But for a bigger dataframe (~ 5/6 columns with missing values) it can be useful.

For a quick and dirty visualization, you can also print the number of missing value (aka np.nan, in pandas/numpy formulation) :

df.isna().sum()
Out[72]: 
id      0
time    0
fuel    2
dtype: int64

Upvotes: 0

Related Questions