Nima
Nima

Reputation: 1540

Add new column for count of repeated group values

I have DF with these two columns: P_ID and PR_ID. There are rows with duplicates for each, some P_IDs have multiple PR_ID and the reverse is also true. Reproducible example:

DF <- data.frame(
    'p_id' = sample(100:300, 100, replace=T),
    'pr_id' = sample(100:300, 100, replace=T)
)

I would like a new column pr_count to DF added based on how many pr_id there is for each p_id.

In practice some of the p_ids don't have any pr_id. I want them to have 0 for their pr_count.

Desired output format:

p_id  pr_id     pr_count
----  ----      --------
1     2             0

Upvotes: 1

Views: 705

Answers (2)

rnso
rnso

Reputation: 24623

Try:

'p_id' = sample(100:300, 10, replace=T)
'pr_id' = sample(100:300, 10, replace=T)

DF <- data.frame(p_id, pr_id)
outdf = expand.grid(p_id, pr_id)
outdf$count = ifelse(outdf$Var1==p_id & outdf$Var2==pr_id, 1, 0)
names(outdf)= c("p_id","pr_id","count")

head(outdf)
    p_id pr_id count
1    295   193     1
2    112   193     0
3    283   193     0
4    163   193     0
5    202   193     0
6    135   193     0

Upvotes: 0

Joris Meys
Joris Meys

Reputation: 108603

Numeric id values

For this, ave() is the workhorse you're looking for. Take a data frame with some p_id's that don't have a pr_id and vice versa :

DF <- data.frame(
  'p_id' = sample(100:300, 100, replace=T),
  'pr_id' = sample(100:300, 100, replace=T)
)

DF$pr_id[sample(1:100,10)] <- NA
DF$p_id[sample(1:100,10)] <- NA

Use ave() to create the vector pr_count

DF <- within(DF,{
  pr_count <- ave(pr_id,p_id,
                  FUN = function(x)length(na.omit(x))
                  )
  pr_count[is.na(p_id)] <- 0  
})

Note that you have to replace the values in pr_count when p_id is NA. By default, ave() will just copy the value of pr_id wherever p_id is NA.

A General solution

You have to keep in mind that ave() will create a vector that has the same type as the initial x argument (in this case, pr_id). So if you have character ids, you need to convert to numeric at the end as well:

DF <- data.frame(
  'p_id' = sample(letters, 100, replace=T),
  'pr_id' = sample(LETTERS, 100, replace=T),
  stringsAsFactors = FALSE
)

DF$pr_id[sample(1:100,10)] <- NA
DF$p_id[sample(1:100,10)] <- NA

DF <- within(DF,{
  pr_count <- ave(pr_id,p_id,
                  FUN = function(x)length(na.omit(x))
                  )
  pr_count[is.na(p_id)] <- 0
  pr_count <- as.numeric(pr_count)

})

Which gives again the desired result:

> str(DF)
'data.frame':   100 obs. of  3 variables:
 $ p_id    : chr  "m" "f" "u" "n" ...
 $ pr_id   : chr  "Y" "W" "R" "M" ...
 $ pr_count: num  2 2 3 2 2 2 2 5 4 2 ...

warning This behaviour of ave() also means that you'll get quite some trouble with factors. Before you use this, make sure that your ids are not stored as factors in your data frame !!!

Upvotes: 1

Related Questions