maric92
maric92

Reputation: 23

Not correctly visualized boxplot in pandas

Have a cleared, prepared to visualization data set.

        category_id             views   likes   dislikes    comment_count
0       politics/celebrities    34785     308    26         413
1       talk show               69844     3417   33         160
2       talk show               1496225   16116  236        605
3       talk show               1497519   15504  353        1084
4       various video           225286    1731   193        206
... ... ... ... ... ...
4119    music clips             6004782   210802 4166       15169
4120    talk show               5564576   46351  2295       2861
4121    music clips             5534278   45128  1591       806
4122    music clips             23502572  676467 15993      52432
4123    talk show               1066451   48068  1032       3992

When trying to visual on boxplot with

data.boxplot('views')

take non-correct visualization

enter image description here

instead normal boxplot-kind. On small part of dataset (data[0:10]) it work fine, but on entire set - no. What's is wrong?

Upvotes: 0

Views: 61

Answers (1)

Pierre D
Pierre D

Reputation: 26301

As I said in the comment, your views are likely following a Zipf law distribution.

To illustrate with a reproducible example, let's use Wikipedia's page views for the top 1000 articles on en.wikipedia.org for a single day:

import urllib.request, json 

url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia.org/all-access/2021/04/11'
with urllib.request.urlopen(url) as f:
    data = json.loads(f.read().decode())

df = pd.json_normalize(data, ['items', 'articles'])

Notice how just doing a boxplot of views is very similar to what you observe:

df.boxplot('views')

Now here is the loglog plot that shows the distinctive tell-tale of a Zipf law for that data:

(As it turns out, the rank is already included in the Wikipedia data above, but we'll compute it again for sake of generality).

plt.loglog(df['views'].rank(ascending=False), df['views'])
plt.grid(True)
plt.xlabel('rank')
plt.ylabel('views')
plt.title('en.wikipedia views of top 1000 articles\non 2020-04-11')

With this in mind, you can see how the notion of "outlier" is hard to define for a power-law. No amount of filtering is going to help your boxplot, as such distributions are roughly scale-invariant.

Upvotes: 1

Related Questions