Reputation: 91
I have two vectors. First vector name is comments$author_id
and second is enrolments$learner_id
. I want to add new column into enrolmens
dataframe that shows count of repeated rows in comments$author_id
vector for each enrolment$learner_id
row.
Example:
if(enrolments$learner_id[1] repeated 5 times in comments$author_id)
enrolments$freqs[1] = 5
Can I do this don't using any loops?
The vector samples are as follows:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,4,5,6))
I want to add "counts" column to "df2" that shows counts of repeated v2 element in v1.
Upvotes: 1
Views: 101
Reputation: 54237
"[tabulate] gives me this error: Error in $<-.data.frame(tmp, "comments_count", value = c(0L, 0L, : replacement has 25596 rows, data has 25597"
That is prly because there is one value at the end of df2$v2
, which are not part of df1$v1
- I add 0
and 7
to your example to show that:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,0,4,5,6,7))
df2$count <- tabulate(factor(df1$v1, df2$v2))
# Error in `$<-.data.frame`(`*tmp*`, count, value = c(7L, 5L, 3L, 0L, 5L, :
# replacement has 7 rows, data has 8
To correct that using tabulate
, which might be the fastest solution on larger data:
df2$count <- tabulate(factor(df1$v1, df2$v2), length(df2$v2))
df2
# v2 count
# 1 1 7
# 2 2 5
# 3 3 3
# 4 0 0
# 5 4 5
# 6 5 6
# 7 6 2
# 8 7 0
See ?tabulate
for the documentation on that function.
Upvotes: 1
Reputation: 93803
What you essentially are doing is aggregating the df1
to get a count, and then adding this count back to the df2
set. This logic can be easily translated to a bunch of different methods:
# base R
merge(
df2,
aggregate(cbind(df1[0], count=1), df1["v1"], FUN=sum),
by.x="v2", by.y="v1", all.x=TRUE
)
# data.table
library(data.table)
setDT(df1)
setDT(df2)
df2[df1[, .(count=.N), by=v1], on=c("v2"="v1")]
# dplyr
library(dplyr)
df1 %>%
group_by(v1) %>%
count() %>%
left_join(df2, ., by=c("v2"="v1"))
# v2 count
#1 1 7
#2 2 5
#3 3 3
#4 4 5
#5 5 6
#6 6 2
Upvotes: 1
Reputation: 1042
Using your df1
and df2
example, you could do it like this:
# Make data
df1 = data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 = data.frame(v2 = c(1,2,3,4,5,6))
# Add 'count' variable as reqeuested
df2$counts = sapply(df2$v2, function(x) {
sum(df1$v1 == x, na.rm = T) #na.rm=T just in case df1$v1 has missing values
})
df2 #view output
Upvotes: 1