Tobias van Elferen
Tobias van Elferen

Reputation: 489

Using summarise with weighted mean from dplyr in R

I'm trying to tidy a dataset, using dplyr. My variables contain percentages and straightforward values (in this case, page views and bounce rates). I've tried to summarize them this way:

require(dplyr)
df<-df%>%
   group_by(pagename)%>%
   summarise(pageviews=sum(pageviews), bounceRate= weighted.mean(bounceRate,pageviews))

But this returns:

 Error: 'x' and 'w' must have the same length

My dataset does not have any NA's in the both the page views and the bounce rates. I'm not sure what I'm doing wrong, maybe summarise() doesn't work with weighted.mean()?

EDIT

I've added some data:

### Source: local data frame [4 x 3]

###               pagename bounceRate pageviews
                    (chr)      (dbl)     (dbl)
###1                url1   72.22222      1176
###2                url2   46.42857       733
###3                url2   76.92308       457
###4                url3   62.06897       601

Upvotes: 28

Views: 39410

Answers (2)

MrFlick
MrFlick

Reputation: 206546

The summarize() command replaces variables in the order they appear in the command, so because you are changing the value of pageviews, that new value is being used in the weighted.mean. It's safer to use different names

df %>%
   group_by(pagename)%>%
   summarise(pageviews_sum = sum(pageviews), 
      bounceRate_mean = weighted.mean(bounceRate,pageviews))

And if you really want, you can rename afterward

df %>%
   group_by(pagename) %>%
   summarise(pageviews_sum = sum(pageviews), 
      bounceRate_mean = weighted.mean(bounceRate,pageviews)) %>% 
   rename(pageviews = pageviews_sum, bounceRate = bounceRate_mean)

Upvotes: 41

Tobias van Elferen
Tobias van Elferen

Reputation: 489

I've found the solution. Since summarise(pageviews=sum(pageviews) is evaluated before bounceRate= weighted.mean(bounceRate,pageviews), the length of pageviewsis reduced and therefore shorter than bounceRate, which triggers the error.

The solution is simple, just switch them:

require(dplyr)
df<-df%>%
  group_by(pagename)%>%
  summarise(bounceRate= weighted.mean(bounceRate,pageviews),pageviews=sum(pageviews))

Upvotes: 8

Related Questions