Reputation: 61

Sum up cells by some columns

I have a table looking like this:

 df <- read.table(text = 
      "  Day            location     gender    hashtags
       'Feb 19 2016'       'UK'      'M'       '#a'
       'Feb 19 2016'       'UK'      'M'       '#b'
       'Feb 19 2016'       'SP'      'F'       '#a'
       'Feb 19 2016'       'SP'      'F'       '#b'
       'Feb 19 2016'       'SP'      'M'       '#a'
       'Feb 19 2016'       'SP'      'M'       '#b'
       'Feb 20 2016'       'UK'      'F'       '#a'", 
                 header = TRUE, stringsAsFactors = FALSE)

And I want to calculate frequencies by day/hashtag/location and gender, having a result table looking like this:

           Day hashtags Daily_Freq men women Freq_UK Freq_SP
   Feb 19 2016       #a          3   2     1       1       2
   Feb 19 2016       #b          3   2     1       1       1
   Feb 20 2016       #a          1   0     1       1       0

where Daily_freq=men+women=Freq_UK+Freq_SP How can I do this?

Upvotes: 3

Answers (3)

Frank

Reputation: 66819

One way...

library(data.table)

setDT(df)
df[, gender := as.factor(gender)]
df[, location := as.factor(location)]

df[, c(
  N = .N, 
  dcast(.SD, . ~ gender, fun.agg = length, drop=FALSE)[, !"."],
  dcast(.SD, . ~ location, fun.agg = length, drop=FALSE)[, !"."]
), by=.(Day, hashtags)]

#            Day hashtags N F M SP UK
# 1: Feb 19 2016       #a 3 1 2  2  1
# 2: Feb 19 2016       #b 3 1 2  2  1
# 3: Feb 20 2016       #a 1 1 0  0  1

Coding it this way might be easier to maintain: column names don't need to be assigned manually; locations and genders will show up or drop out of the result according to whether they appear in the raw data; and column names don't need to be typed out in multiple places (after the conversion to factor).

In case there are country codes that match gender codes, this way will produce duplicate columns. To get around that:

df[, c(
  N = .N, 
  gender = dcast(.SD, . ~ gender, fun.agg = length, drop=FALSE)[, !"."],
  loc = dcast(.SD, . ~ location, fun.agg = length, drop=FALSE)[, !"."]
), by=.(Day, hashtags)]

#            Day hashtags N gender.F gender.M loc.SP loc.UK
# 1: Feb 19 2016       #a 3        1        2      2      1
# 2: Feb 19 2016       #b 3        1        2      2      1
# 3: Feb 20 2016       #a 1        1        0      0      1

Upvotes: 4

Rui Barradas

Reputation: 76402

Using package reshape2.

library(reshape2)

molten <- melt(df, id.vars = c("Day", "hashtags"))
result <- dcast(molten, Day + hashtags ~ variable + value, length)
result$Daily_Freq <- rowSums(result[, c("location_SP", "location_UK")])

result
#          Day hashtags location_SP location_UK gender_F gender_M Daily_Freq
#1 Feb 19 2016       #a           2           1        1        2          3
#2 Feb 19 2016       #b           2           1        1        2          3
#3 Feb 20 2016       #a           0           1        1        0          1

Note that the columns are not in the order of example output. It is simple to reorder them.

Upvotes: 3

Jaap

Reputation: 83215

Using dplyr:

library(dplyr)
df %>% 
  group_by(Day, hashtags) %>% 
  summarise(Daily_Freq = n(),
            men = sum(gender == 'M'),
            women = sum(gender == 'F'),
            Freq_UK = sum(location == 'UK'),
            Freq_SP = sum(location == 'SP'))

gives:

# A tibble: 3 x 7
# Groups:   Day [?]
  Day         hashtags Daily_Freq   men women Freq_UK Freq_SP
  <chr>       <chr>         <int> <int> <int>   <int>   <int>
1 Feb 19 2016 #a                3     2     1       1       2
2 Feb 19 2016 #b                3     2     1       1       2
3 Feb 20 2016 #a                1     0     1       1       0

The same logic implemented in data.table:

library(data.table)
setDT(df)[, .(Daily_Freq = .N,
              men = sum(gender == 'M'),
              women = sum(gender == 'F'),
              Freq_UK = sum(location == 'UK'),
              Freq_SP = sum(location == 'SP'))
          , by = .(Day, hashtags)]

Upvotes: 6

Sum up cells by some columns

Answers (3)

Related Questions