Kurt Peek
Kurt Peek

Reputation: 57411

How to plot a Pandas data frame with time series as rows?

I'm trying to plot this dataset of COVID-19 deaths as a time series of the number of deaths per country. So far, I've tried this script:

import requests
import pandas as pd
import matplotlib.pyplot as plt


def getdata():
    response = requests.get("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
    with open('data.csv', 'wb') as fp:
        fp.write(response.content)


if __name__ == "__main__":
    getdata()
    df = pd.read_csv('data.csv')

    dfg = df.groupby(by='Country/Region').sum()

    dfg.drop(labels=['Lat', 'Long'], axis=1, inplace=True)

    dfg.columns = pd.to_datetime(dfg.columns)

    dfplot = dfg.plot()

    plt.show()

which produces a data frame like this:

                    2020-01-22  2020-01-23  2020-01-24  ...  2020-03-25  2020-03-26  2020-03-27
Country/Region                                          ...                                    
Afghanistan                  0           0           0  ...           2           4           4
Albania                      0           0           0  ...           5           6           8
Algeria                      0           0           0  ...          21          25          26
Andorra                      0           0           0  ...           1           3           3
Angola                       0           0           0  ...           0           0           0
...                        ...         ...         ...  ...         ...         ...         ...
Venezuela                    0           0           0  ...           0           0           1
Vietnam                      0           0           0  ...           0           0           0
West Bank and Gaza           0           0           0  ...           0           1           1
Zambia                       0           0           0  ...           0           0           0
Zimbabwe                     0           0           0  ...           1           1           1

However, the resulting plot does not show a time series, but rather has different countries on the X-axis:

enter image description here

I've tried reading the DataFrame.plot documentation to see how I could alter this behavior but it's pretty terse. Any ideas how I might accomplish this?

Upvotes: 2

Views: 1388

Answers (3)

Franck Selsis
Franck Selsis

Reputation: 101

sorry if this is not the right place to ask, I'm new here. How would you plot the same curves but not as a function of the date but as a function of the number of days since 10th (or any other number) death? So the first day with 10 deaths or more becomes day 1?

Upvotes: 0

Kurt Peek
Kurt Peek

Reputation: 57411

Following wwii's comment, another solution is to simply plot the transpose of the DataFrame, dfg.T.

If I add selecting only the countries with the most deaths at the latest date (i.e. by the values of the last column), I arrive at the following script,

import requests
import pandas as pd
import matplotlib.pyplot as plt


def getdata():
    response = requests.get("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
    with open('data.csv', 'wb') as fp:
        fp.write(response.content)


if __name__ == "__main__":
    getdata()
    df = pd.read_csv('data.csv')
    dfg = df.groupby(by='Country/Region').sum()
    dfg.sort_values(by=dfg.columns[-1], ascending=False, inplace=True)
    dfg.drop(labels=['Lat', 'Long'], axis=1, inplace=True)
    dfg.columns = pd.to_datetime(dfg.columns)
    dfplot = dfg.iloc[:10].T.plot()
    plt.show()

which produces the same plot as shown in the accepted answer:

enter image description here

Upvotes: 1

Parfait
Parfait

Reputation: 107567

To achieve a time series plot in pandas, your index should be datetime not as columns. And because their original data arrived with dates as columns some data reshaping is needed:

  • melt to reshape the original data from wide to long format with Date as a column;
  • pivot_table to aggregate and reshape to wide for country as columns with Date as index.

Then, call DataFrame.plot as intended:

df_deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/"
                      "csse_covid_19_time_series/time_series_covid19_deaths_global.csv")

# MELT WIDE DATA TO LONG
df_deaths = (df_deaths.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'], 
                            var_name = 'Date', value_name = 'Deaths')
                      .assign(Date = lambda x: pd.to_datetime(x['Date'])))

# PIVOT AGGREGATION TO GENERATE DATE INDEX BY COUNTRY COLUMNS
df_pvt = df_deaths.pivot_table(index='Date', columns='Country/Region', 
                               values='Deaths', aggfunc='sum')

df_pvt.plot(kind='line')

plt.show()

And because above is such an overwhelming plot with nearly all countries of world, consider slicing only a handful of countries like top 10 affected and integrate matplotlib Axes objects for better control of output:

top_countries = (df_deaths.groupby('Country/Region')['Deaths'].sum()
                          .sort_values(ascending=False))

fig, ax = plt.subplots(figsize=(15,6))

(df_pvt.reindex(top_countries.index.values[:10], axis = 'columns')
       .plot(kind='line', ax = ax))

plt.show()

Plot Output

Upvotes: 4

Related Questions