Reputation: 61
I have a table looking like this:
df <- read.table(text =
" Day location gender hashtags
'Feb 19 2016' 'UK' 'M' '#a'
'Feb 19 2016' 'UK' 'M' '#b'
'Feb 19 2016' 'SP' 'F' '#a'
'Feb 19 2016' 'SP' 'F' '#b'
'Feb 19 2016' 'SP' 'M' '#a'
'Feb 19 2016' 'SP' 'M' '#b'
'Feb 20 2016' 'UK' 'F' '#a'",
header = TRUE, stringsAsFactors = FALSE)
And I want to calculate frequencies by day/hashtag/location and gender, having a result table looking like this:
Day hashtags Daily_Freq men women Freq_UK Freq_SP
Feb 19 2016 #a 3 2 1 1 2
Feb 19 2016 #b 3 2 1 1 1
Feb 20 2016 #a 1 0 1 1 0
where Daily_freq=men+women=Freq_UK+Freq_SP How can I do this?
Upvotes: 3
Views: 55
Reputation: 66819
One way...
library(data.table)
setDT(df)
df[, gender := as.factor(gender)]
df[, location := as.factor(location)]
df[, c(
N = .N,
dcast(.SD, . ~ gender, fun.agg = length, drop=FALSE)[, !"."],
dcast(.SD, . ~ location, fun.agg = length, drop=FALSE)[, !"."]
), by=.(Day, hashtags)]
# Day hashtags N F M SP UK
# 1: Feb 19 2016 #a 3 1 2 2 1
# 2: Feb 19 2016 #b 3 1 2 2 1
# 3: Feb 20 2016 #a 1 1 0 0 1
Coding it this way might be easier to maintain: column names don't need to be assigned manually; locations and genders will show up or drop out of the result according to whether they appear in the raw data; and column names don't need to be typed out in multiple places (after the conversion to factor).
In case there are country codes that match gender codes, this way will produce duplicate columns. To get around that:
df[, c(
N = .N,
gender = dcast(.SD, . ~ gender, fun.agg = length, drop=FALSE)[, !"."],
loc = dcast(.SD, . ~ location, fun.agg = length, drop=FALSE)[, !"."]
), by=.(Day, hashtags)]
# Day hashtags N gender.F gender.M loc.SP loc.UK
# 1: Feb 19 2016 #a 3 1 2 2 1
# 2: Feb 19 2016 #b 3 1 2 2 1
# 3: Feb 20 2016 #a 1 1 0 0 1
Upvotes: 4
Reputation: 76402
Using package reshape2
.
library(reshape2)
molten <- melt(df, id.vars = c("Day", "hashtags"))
result <- dcast(molten, Day + hashtags ~ variable + value, length)
result$Daily_Freq <- rowSums(result[, c("location_SP", "location_UK")])
result
# Day hashtags location_SP location_UK gender_F gender_M Daily_Freq
#1 Feb 19 2016 #a 2 1 1 2 3
#2 Feb 19 2016 #b 2 1 1 2 3
#3 Feb 20 2016 #a 0 1 1 0 1
Note that the columns are not in the order of example output. It is simple to reorder them.
Upvotes: 3
Reputation: 83215
Using dplyr
:
library(dplyr)
df %>%
group_by(Day, hashtags) %>%
summarise(Daily_Freq = n(),
men = sum(gender == 'M'),
women = sum(gender == 'F'),
Freq_UK = sum(location == 'UK'),
Freq_SP = sum(location == 'SP'))
gives:
# A tibble: 3 x 7 # Groups: Day [?] Day hashtags Daily_Freq men women Freq_UK Freq_SP <chr> <chr> <int> <int> <int> <int> <int> 1 Feb 19 2016 #a 3 2 1 1 2 2 Feb 19 2016 #b 3 2 1 1 2 3 Feb 20 2016 #a 1 0 1 1 0
The same logic implemented in data.table
:
library(data.table)
setDT(df)[, .(Daily_Freq = .N,
men = sum(gender == 'M'),
women = sum(gender == 'F'),
Freq_UK = sum(location == 'UK'),
Freq_SP = sum(location == 'SP'))
, by = .(Day, hashtags)]
Upvotes: 6