Reputation: 23
Have a cleared, prepared to visualization data set.
category_id views likes dislikes comment_count
0 politics/celebrities 34785 308 26 413
1 talk show 69844 3417 33 160
2 talk show 1496225 16116 236 605
3 talk show 1497519 15504 353 1084
4 various video 225286 1731 193 206
... ... ... ... ... ...
4119 music clips 6004782 210802 4166 15169
4120 talk show 5564576 46351 2295 2861
4121 music clips 5534278 45128 1591 806
4122 music clips 23502572 676467 15993 52432
4123 talk show 1066451 48068 1032 3992
When trying to visual on boxplot with
data.boxplot('views')
take non-correct visualization
instead normal boxplot-kind. On small part of dataset (data[0:10]) it work fine, but on entire set - no. What's is wrong?
Upvotes: 0
Views: 61
Reputation: 26301
As I said in the comment, your views
are likely following a Zipf law distribution.
To illustrate with a reproducible example, let's use Wikipedia's page views for the top 1000 articles on en.wikipedia.org
for a single day:
import urllib.request, json
url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia.org/all-access/2021/04/11'
with urllib.request.urlopen(url) as f:
data = json.loads(f.read().decode())
df = pd.json_normalize(data, ['items', 'articles'])
Notice how just doing a boxplot of views
is very similar to what you observe:
df.boxplot('views')
Now here is the loglog plot that shows the distinctive tell-tale of a Zipf law for that data:
(As it turns out, the rank is already included in the Wikipedia data above, but we'll compute it again for sake of generality).
plt.loglog(df['views'].rank(ascending=False), df['views'])
plt.grid(True)
plt.xlabel('rank')
plt.ylabel('views')
plt.title('en.wikipedia views of top 1000 articles\non 2020-04-11')
With this in mind, you can see how the notion of "outlier" is hard to define for a power-law. No amount of filtering is going to help your boxplot, as such distributions are roughly scale-invariant.
Upvotes: 1