Marius
Marius

Reputation: 11

Seaborn – how to interpret the values of the x-axis in a distplot?

I have a dataframe in which each row represents a date where a number of events were logged in a database.

Every event has a date it concerns, so for example an event that was logged on 2017-02-03 might belong to 2017-02-02 (meaning it was logged the day after it happened).

Screenshot of the dataframe head

I'm trying to visualize the distribution of each column in a distplot, to get an idea of the distance between when an event was logged, and the date it concerns ("Do people log events on the same day, the day after, or even later?").

So far I've made a function that iterates over each column and plots it into a seaborn distplot.

def plot(dates):
    plt.figure(figsize=(45,25))
    for date in dates:
        sns.distplot(df[date], kde=False, bins=len(dates))
    return

The plot then looks like this: this

However, I can't understand how to interpret the values on the x-axis.

It shows a range from 0 - 3500, what does that mean?

Are there any other better ways to visualize this?

Upvotes: 1

Views: 2234

Answers (2)

IlonaV
IlonaV

Reputation: 21

In your current skript you loop over the data and plot all the output in the same figure. In the example plot that you provide, the x-axis shows the occurence of events in on the 2017-02-28, which I assume is the last date in your dataset. However, the different colors present the data from the other dates, which are plotted in the same figure inside your loop. About the interpretations of the plot: the x-axis shows the number of events per day: towards the right side of the figure you can see that there is usually only one day when a large number (> 1000 or so) of events is recorder. From the left side of the figure you can tell that there is around 50 days when only one event is recorded. A simple bar chart might be easier to interpret: it will show you on the x-axis the date and on the y-axis the number of events recorded. You could plot and save a separate barchart figure for each date by modifying your function as follows:

def plot(dates):
  for date in dates:
    plt.figure()
    plt.bar(np.arange(0,len(dates)),df[date], width=1.0)
    ax=plt.gca()
    ax.set_xticks(np.arange(0,len(dates)),)+0.5)
    ax.set_xticklabels(dates)
    plt.savefig('barchart_'+date+'.png')
    return

There are probably more elegant ways than this to study your data, but I hope this helps you in getting forward.

Upvotes: 2

ImportanceOfBeingErnest
ImportanceOfBeingErnest

Reputation: 339200

sns.distplot is a histogram. That means that it shows how often a certain value falls into a certain bin.

Here, you calculate the histogram of every column. So in the plot you find how often a certain value occurs in this column. "How often" is the y-axis, the value is on the x axis.

Because you're doing it for every one of the n columns of the dataframe, you end up with n different histograms (each with a different color).

For example, there is only one value above 3000 in each column, therfore you see a small bar around 3000 in the plot. On the other hand there are many values between 0 and 100 in each column, so you see a large bar around 0.

Upvotes: 1

Related Questions