Reputation: 1540
I have DF with these two columns: P_ID and PR_ID. There are rows with duplicates for each, some P_IDs have multiple PR_ID and the reverse is also true. Reproducible example:
DF <- data.frame(
'p_id' = sample(100:300, 100, replace=T),
'pr_id' = sample(100:300, 100, replace=T)
)
I would like a new column pr_count to DF added based on how many pr_id there is for each p_id.
In practice some of the p_ids don't have any pr_id. I want them to have 0 for their pr_count.
Desired output format:
p_id pr_id pr_count
---- ---- --------
1 2 0
Upvotes: 1
Views: 705
Reputation: 24623
Try:
'p_id' = sample(100:300, 10, replace=T)
'pr_id' = sample(100:300, 10, replace=T)
DF <- data.frame(p_id, pr_id)
outdf = expand.grid(p_id, pr_id)
outdf$count = ifelse(outdf$Var1==p_id & outdf$Var2==pr_id, 1, 0)
names(outdf)= c("p_id","pr_id","count")
head(outdf)
p_id pr_id count
1 295 193 1
2 112 193 0
3 283 193 0
4 163 193 0
5 202 193 0
6 135 193 0
Upvotes: 0
Reputation: 108603
Numeric id values
For this, ave()
is the workhorse you're looking for. Take a data frame with some p_id's that don't have a pr_id and vice versa :
DF <- data.frame(
'p_id' = sample(100:300, 100, replace=T),
'pr_id' = sample(100:300, 100, replace=T)
)
DF$pr_id[sample(1:100,10)] <- NA
DF$p_id[sample(1:100,10)] <- NA
Use ave()
to create the vector pr_count
DF <- within(DF,{
pr_count <- ave(pr_id,p_id,
FUN = function(x)length(na.omit(x))
)
pr_count[is.na(p_id)] <- 0
})
Note that you have to replace the values in pr_count
when p_id
is NA
. By default, ave()
will just copy the value of pr_id
wherever p_id
is NA
.
A General solution
You have to keep in mind that ave()
will create a vector that has the same type as the initial x
argument (in this case, pr_id
). So if you have character ids, you need to convert to numeric at the end as well:
DF <- data.frame(
'p_id' = sample(letters, 100, replace=T),
'pr_id' = sample(LETTERS, 100, replace=T),
stringsAsFactors = FALSE
)
DF$pr_id[sample(1:100,10)] <- NA
DF$p_id[sample(1:100,10)] <- NA
DF <- within(DF,{
pr_count <- ave(pr_id,p_id,
FUN = function(x)length(na.omit(x))
)
pr_count[is.na(p_id)] <- 0
pr_count <- as.numeric(pr_count)
})
Which gives again the desired result:
> str(DF)
'data.frame': 100 obs. of 3 variables:
$ p_id : chr "m" "f" "u" "n" ...
$ pr_id : chr "Y" "W" "R" "M" ...
$ pr_count: num 2 2 3 2 2 2 2 5 4 2 ...
warning This behaviour of ave()
also means that you'll get quite some trouble with factors. Before you use this, make sure that your ids are not stored as factors in your data frame !!!
Upvotes: 1