Reputation: 489
I would like to get a count table from several values in my dataframe, which are spread over several columns. In my case, several of my subjects have one or more types of medication. Now they are mentioned in several columns. I would instead like to have each medication name in a separate column, and a 1 or 0 (a count) per subject if they have it or not.
For example, my dataframe now looks like:
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA
I would like to it look like:
Pp trimipramin escitalopram quetiapin venlafaxin duloxetin …
1 0 0 0 0 0 …
2 1 0 0 0 0 …
3 0 1 1 1 0 …
4 0 0 0 0 1 …
5 0 1 1 0 0 …
6 0 0 0 0 0 …
7 0 1 0 1 1 …
Would it be possible to get such a count table added to my dataframe and not separately?
I assume dplyr's spread()
might be useful in this case, but I couldn't get it to work :-(
Any help?
Upvotes: 0
Views: 52
Reputation: 527
I do not know if it is the easiest solution, but it works. First you have to get all medications in a vector. I do that with unlist
and unqiue
. Then you have to loop through them and and get the number of cases. I do that with the apply
function and the sum of matches. As you want it in the same data.frame
you can store it with data[[name]]
. This solution works without any packages.
names = as.character(na.omit(unique(unlist(data))))
for(name in names){
data[[name]] = apply(data, FUN = function(x) { sum(x == name, na.rm = T) }, 1)
}
data = read.table(text = "X X2 X3 X4
NA NA NA NA
trimipramin NA NA NA
Quetiapin venlafaxin escitalopram NA
duloxetin amitriptylin NA NA
quetiapin citalopram escitalopram Lithium
NA NA NA NA
duloxetin escitalopram venlafaxin NA", header = T, stringsAsFactors = F)
Upvotes: 1
Reputation: 742
The table
function in base
is a simple method for creating this type of binary indicator matrix.
library(tidyverse)
D %>%
gather(key, drug, -Pp) %>%
select(-key) %>%
table
#>Pp amitriptylin citalopram duloxetin escitalopram Lithium ...
#> 1 0 0 0 0 0 ...
#> 2 0 0 0 0 0 ...
#> 3 0 0 0 1 0 ...
#> 4 1 0 1 0 0 ...
#> 5 0 1 0 1 1 ...
#> 6 0 0 0 0 0 ...
#> 7 0 0 1 1 0 ...
You can also do it using spread
instead, but this solution will remove patients without any associated drugs:
D %>%
gather(key, value, -Pp) %>%
filter(!is.na(value)) %>%
mutate(key=1) %>%
spread(value, key, fill=0, drop = FALSE)
Upvotes: 2
Reputation: 13125
library(dplyr)
library(tidyr)
library(reshape2)
data %>% gather(key = "Med", value = "Value", -Pp) %>% select(-Med) %>%
dcast(Pp~Value,fun.aggregate = function(x){as.integer(length(x) > 0)})
data <- read.table(text="
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA",header=T, stringsAsFactors = F)
Upvotes: 1