Reputation: 49
I have a dataframe like below:
> df
pat_id disease
[1,] "pat1" "dis1"
[2,] "pat1" "dis1"
[3,] "pat2" "dis0"
[4,] "pat2" "dis5"
[5,] "pat3" "dis2"
[6,] "pat3" "dis2"
How can I write a function to get a third variable which indicates for the same pat_id the disease variable is the same or not , like below?
> df
pat_id disease var3
[1,] "pat1" "dis1" "1"
[2,] "pat1" "dis1" "1"
[3,] "pat2" "dis0" "0"
[4,] "pat2" "dis5" "0"
[5,] "pat3" "dis2" "1"
[6,] "pat3" "dis2" "1"
Upvotes: 3
Views: 83
Reputation: 887078
One option using dplyr
library(dplyr)
as.data.frame(df) %>%
group_by(pat_id) %>%
mutate(var3 = as.integer(n_distinct(disease)==1))
# pat_id disease var3
# (chr) (chr) (int)
#1 pat1 dis1 1
#2 pat1 dis1 1
#3 pat2 dis0 0
#4 pat2 dis5 0
#5 pat3 dis2 1
#6 pat3 dis2 1
Upvotes: 1
Reputation: 99331
Try ave()
for the groupings, and wrap the result from any(duplicated())
, with as.integer()
. Then bind with cbind()
. Although I might recommend you use a data frame instead of a matrix here.
cbind(
df,
var3 = ave(df[,2], df[,1], FUN = function(x) as.integer(any(duplicated(x)))
)
# pat_id disease var3
# [1,] "pat1" "dis1" "1"
# [2,] "pat1" "dis1" "1"
# [3,] "pat2" "dis0" "0"
# [4,] "pat2" "dis5" "0"
# [5,] "pat3" "dis2" "1"
# [6,] "pat3" "dis2" "1"
For larger data, I would recommend converting to a data table. The syntax is actually a bit nicer too, and it will likely be faster.
library(data.table)
dt <- as.data.table(df)
dt[, var3 := if(any(duplicated(disease))) 1 else 0, by = pat_id]
which gives
pat_id disease var3
1: pat1 dis1 1
2: pat1 dis1 1
3: pat2 dis0 0
4: pat2 dis5 0
5: pat3 dis2 1
6: pat3 dis2 1
where column classes will be more appropriate (char, char, int). Or you could use as.integer(any(duplicated(disease)))
instead of if
/else
.
Upvotes: 6
Reputation: 8215
Slightly long-winded, but gives you a boolean third variable which is more easily tested. It also doesn't care about data types
> df <- data.frame(pat_id=c("pat1","pat1", "pat2", "pat2", "pat3", "pat3"),
+ disease=c("dis1","dis1","dis0","dis5","dis2","dis2"),
+ stringsAsFactors = F)
> counts<-apply(table(df), 1, function(x) sum(x!=0))
> df2<-data.frame(pat_id=names(counts), all_the_same=(counts==1))
> df3<-merge(df,df2)
> df3
pat_id disease all_the_same
1 pat1 dis1 TRUE
2 pat1 dis1 TRUE
3 pat2 dis0 FALSE
4 pat2 dis5 FALSE
5 pat3 dis2 TRUE
6 pat3 dis2 TRUE
> sapply(df3, class)
pat_id disease all_the_same
"character" "character" "logical"
This doesn't care how many of each combination you have and should leave your strings as strings - not factors.
Having the new column as a logical lets you more easily do queries such as finding all patients for which it is true
> unique(df3$pat_id[df3$all_the_same])
[1] "pat1" "pat3"
Upvotes: 1