Reputation: 43
i have a question for which I thought I found a solution but if I double checked by hand I got numbers. I searched in other quotes but couldn't get exactly what I am looking for.
I have a dataframe with pharmaceutical agents. Each row is a subject and up to 20 columns store an agent each.Then I have a list of agents that can be clustered for one purpose. E.g. beta blockers. What I would like to do is iterate over each row to count if and how many e.g. beta blockers or statins a subject is taking.
I have tried with:
BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
"QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC")
for (i in 1:202) {
dat$betablock[i] <- sum(str_count(meds[i,], BETA ))
}
I don't get a warning but it doesn't count the correct number of occurrences.
Here is some sample data:
Med1 Med2 Med3 Med4 Med5 Med6 Med7 Med8 Med9 Med10 Med11 Med12 Med13 Med14 Med15
1 AMLODIPIN RAMIPRIL METOPROLOL
2 PLAVIX SIMVASTATIN MIRTAZAPIN
3 BISOPROLOL AMLODIPIN ASS VALSARTAN CHLORALDURAT Doxozosin TAMSULOSIN CIPRAMIL
4 ASS ENALAPRIL L-THYROXIN LITALIR LITALIR AMLODIPIN CETIRIZIN HCT NACL CARMEN PROTEIN 88 NOVALGIN
5 ASS ATORVASTATIN FOSAMAX CALCIUM PANTOZOL NOVAMINSULFON
6 ASS FRAGMIN TORASEMID SPIRONOLACTON LORZAAR PROTECT VESIKUR ROCALTROL ATORVASTATIN PREDNISOLON LACTULOSE MIRTAZAPIN LANTUS ACTRAPID PANTOZOL SALBUTAMOL
Med16 Med17 Med18 Med19 Med20
1
2
3
4
5
6 AMPHO MORONAL
As you can see in the first row third column the string 'METOPROLOL' is listed. But when I call the result of my for loop for the first subject it results '0'.
> dat$betablock[1]
[1] 0
Any suggestions?
Upvotes: 0
Views: 265
Reputation: 42592
If I understand correctly, the OP has multiple lists of agents that can be clustered for one purpose not just one list of beta blockers. The OP mentions statins, e.g. The OP wants to count how many different agents belonging to each cluster are being taken by each subject. The counts for each agent cluster are to be appended to each row.
I suggest to compute the sums for all clusters at once rather than to do this manually list by list.
For this, we first need to set-up a data frame with the clustering:
cluster
Purpose Agent 1: BETA METOPROLOL 2: BETA BISOPROLOL 3: BETA NEBILET 4: BETA METOHEXAL 5: BETA SOTALEX 6: BETA QUERTO 7: BETA NEBIVOLOL 8: BETA CARVEDILOL 9: BETA METOPROLOLSUCCINAT 10: BETA BELOC 11: STATIN ATORVASTATIN 12: STATIN SIMVASTATIN 13: STATIN LOVASTATIN 14: STATIN PRAVASTATIN 15: STATIN FLUVASTATIN 16: STATIN PITAVASTIN
cluster
can be created, e.g., by
library(data.table)
library(magrittr)
cluster <- list(
BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
"QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC"),
STATIN = c("ATORVASTATIN", "SIMVASTATIN", "LOVASTATIN", "PRAVASTATIN",
"FLUVASTATIN", "PITAVASTIN")
) %>%
lapply(data.table) %>%
rbindlist(idcol = "Purpose") %>%
setnames("V1", "Agent")
For counting the occurrences, we need to join or merge this table with the list of agents each subject is taking dat
after dat
has been reshaped from wide to long format.
While data in spreadsheet-style wide format, i.e., with one row per subject and many columns, are often suitable for data entry and inspection the database-style long format is often more suitable for data processing.
taken <- melt(setDT(dat)[, ID := .I], "ID", value.name = "Agent", na.rm = TRUE)[
Agent != ""][
, Agent := toupper(Agent)][]
ID variable Agent 1: 1 Med1 AMLODIPIN 2: 2 Med1 PLAVIX 3: 3 Med1 BISOPROLOL 4: 4 Med1 ASS 5: 5 Med1 ASS 6: 6 Med1 ASS 7: 1 Med2 RAMIPRIL 8: 2 Med2 SIMVASTATIN 9: 3 Med2 AMLODIPIN 10: 4 Med2 ENALAPRIL 11: 5 Med2 ATORVASTATIN 12: 6 Med2 FRAGMIN 13: 1 Med3 METOPROLOL 14: 2 Med3 MIRTAZAPIN 15: 3 Med3 ASS 16: 4 Med3 L-THYROXIN 17: 5 Med3 FOSAMAX 18: 6 Med3 TORASEMID 19: 3 Med4 VALSARTAN 20: 4 Med4 LITALIR 21: 5 Med4 CALCIUM 22: 6 Med4 SPIRONOLACTON 23: 3 Med5 CHLORALDURAT 24: 4 Med5 LITALIR 25: 5 Med5 PANTOZOL 26: 6 Med5 LORZAAR PROTECT 27: 3 Med6 DOXOZOSIN 28: 4 Med6 AMLODIPIN 29: 5 Med6 NOVAMINSULFON 30: 6 Med6 VESIKUR 31: 3 Med7 TAMSULOSIN 32: 4 Med7 CETIRIZIN 33: 6 Med7 ROCALTROL 34: 3 Med8 CIPRAMIL 35: 4 Med8 HCT 36: 6 Med8 ATORVASTATIN 37: 4 Med9 NACL 38: 6 Med9 PREDNISOLON 39: 4 Med10 CARMEN 40: 6 Med10 LACTULOSE 41: 4 Med11 PROTEIN 88 42: 6 Med11 MIRTAZAPIN 43: 4 Med12 NOVALGIN 44: 6 Med12 LANTUS 45: 6 Med13 ACTRAPID 46: 6 Med14 PANTOZOL 47: 6 Med15 SALBUTAMOL 48: 6 Med16 AMPHO MORONAL ID variable Agent
dat
is modified by appending a row number which identifies each subject, then it is reshaped to long format using melt()
. Missing or empty entries are removed and agent names are converted to uppercase for consistency.
Edit In long format it is also easy to check for duplicate agents per subject
taken[duplicated(taken, by = c("ID", "Agent"))]
ID variable Agent 1: 4 Med5 LITALIR
and remove the duplicates:
taken <- unique(taken, by = c("ID", "Agent"))
The final step creates what I believe is the expected result:
ID BETA STATIN Med1 Med2 Med3 Med4 Med5 Med6 Med7 Med8 1: 1 1 0 AMLODIPIN RAMIPRIL METOPROLOL 2: 2 0 1 PLAVIX SIMVASTATIN MIRTAZAPIN 3: 3 1 0 BISOPROLOL AMLODIPIN ASS VALSARTAN CHLORALDURAT Doxozosin TAMSULOSIN CIPRAMIL 4: 4 0 0 ASS ENALAPRIL L-THYROXIN LITALIR LITALIR AMLODIPIN CETIRIZIN HCT 5: 5 0 1 ASS ATORVASTATIN FOSAMAX CALCIUM PANTOZOL NOVAMINSULFON 6: 6 0 1 ASS FRAGMIN TORASEMID SPIRONOLACTON LORZAAR PROTECT VESIKUR ROCALTROL ATORVASTATIN
Pleae, note the additional columns with the counts by cluster (Due to limited space not all columns of the result are shown here). This is created by
cluster[taken, on = .(Agent)][
, dcast(.SD, ID ~ Purpose, length)][
dat, on = "ID"][
, "NA" := NULL][]
using the following operations:
cluster
and taken
to have Purpose
appendeddat
dat <- structure(list(Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL",
"ASS", "ASS", "ASS"), Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN",
"ENALAPRIL", "ATORVASTATIN", "FRAGMIN"), Med3 = c("METOPROLOL",
"MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"), Med4 = c("",
"", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON"), Med5 = c("",
"", "CHLORALDURAT", "LITALIR", "PANTOZOL", "LORZAAR PROTECT"),
Med6 = c("", "", "Doxozosin", "AMLODIPIN", "NOVAMINSULFON",
"VESIKUR"), Med7 = c("", "", "TAMSULOSIN", "CETIRIZIN", "",
"ROCALTROL"), Med8 = c("", "", "CIPRAMIL", "HCT", "", "ATORVASTATIN"
), Med9 = c("", "", "", "NACL", "", "PREDNISOLON"), Med10 = c("",
"", "", "CARMEN", "", "LACTULOSE"), Med11 = c("", "", "",
"PROTEIN 88", "", "MIRTAZAPIN"), Med12 = c("", "", "", "NOVALGIN",
"", "LANTUS"), Med13 = c("", "", "", "", "", "ACTRAPID"),
Med14 = c("", "", "", "", "", "PANTOZOL"), Med15 = c("",
"", "", "", "", "SALBUTAMOL"), Med16 = c("", "", "", "",
"", "AMPHO MORONAL")), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 1
Reputation: 46886
Suppose the data about pharamcueticals is in a data.frame agents
. Translate the elements of agents
into logical values, with TRUE
present when the element is in BETA
found = agents # copy agents...
found[] = unlist(agents) %in% BETA
Then the number of beta blockers in each row (subject) are
rowSums(found)
As a toy example, here are some agents and beta blockers
agents = data.frame(x = c("a", "b", "c"), y = c("b", "c", "d"))
beta = c("a", "b")
and the results of the computation
> agents
x y
1 a b
2 b c
3 c d
> found = agents
> found[] = unlist(agents) %in% beta
> found
x y
1 TRUE TRUE
2 TRUE FALSE
3 FALSE FALSE
> rowSums(found)
[1] 2 1 0
> agents$beta_blockers = rowSums(found)
> agents
x y beta_blockers
1 a b 2
2 b c 1
3 c d 0
This use of %in%
performs an exact match -- the terms in agents
are matched exactly to the terms in beta
. This seems to be what the problem calls for. The original code and other solutions aim for partial match, e.g., if a row had entries "a", "aa", and "ab" then the code would compute
> stringr::str_count(c("a", "aa", "ab"), "a")
[1] 1 2 1
with sum(.)
equal to 4, whereas there is only one occurrence of "a" so the correct answer is 1. Also, the code is susceptible to the default use of factors to represent columns in a data.frame, and apparently stringr
does not convert factors to character vectors before doing the comparison
> agents = data.frame(x = c("a", "b", "c"), y = c("b", "c", "d")
> stringr::str_count(agents[1,], c("a", "b"))
[1] 0 0
> agents = data.frame(x = c("a", "b", "c"), y = c("b", "c", "d"), stringsAsFactors=FALSE)
> stringr::str_count(agents[1,], c("a", "b"))
[1] 1 1
Upvotes: 1
Reputation: 6769
In the following code, you can calculate the sum of each beta blockers in BETA
and the sum of all beta blockers for each individual (row).
library(stringr)
df <- data.frame(
Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL", "ASS", "ASS", "ASS"),
Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN","ENALAPRIL", "ATORVASTATIN", "FRAGMIN"),
Med3 = c("METOPROLOL", "MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"),
Med4 = c("BISOPROLOL", "CARVEDILOL", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON")
)
BETA <- c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
"QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC")
df$med_all <- paste(df$Med1, df$Med2, df$Med3, df$Med4, sep = ", ")
# get sum of each of the beta blockers in BETA
sapply(BETA, function(x) sum(str_count(df$med_all, c(x))))
#> METOPROLOL BISOPROLOL NEBILET METOHEXAL
#> 1 2 0 0
#> SOTALEX QUERTO NEBIVOLOL CARVEDILOL
#> 0 0 0 1
#> METOPROLOLSUCCINAT BELOC
#> 0 0
# get sum of all blockers for each individual
df1<-sapply(BETA, function(x) str_count(df$med_all, c(x)))
rowSums(df1)
#> [1] 2 1 1 0 0 0
Created on 2020-01-16 by the reprex package (v0.3.0)
Upvotes: 1