Torsten
Torsten

Reputation: 43

How to count occurrences of several strings per row in a data frame in R

i have a question for which I thought I found a solution but if I double checked by hand I got numbers. I searched in other quotes but couldn't get exactly what I am looking for.

I have a dataframe with pharmaceutical agents. Each row is a subject and up to 20 columns store an agent each.Then I have a list of agents that can be clustered for one purpose. E.g. beta blockers. What I would like to do is iterate over each row to count if and how many e.g. beta blockers or statins a subject is taking.

I have tried with:

BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
         "QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC")
for (i in 1:202) {
  dat$betablock[i] <- sum(str_count(meds[i,], BETA ))
} 

I don't get a warning but it doesn't count the correct number of occurrences.

Here is some sample data:

       Med1         Med2       Med3          Med4            Med5          Med6       Med7         Med8        Med9     Med10      Med11    Med12    Med13    Med14      Med15
1  AMLODIPIN     RAMIPRIL METOPROLOL                                                                                                                                           
2     PLAVIX  SIMVASTATIN MIRTAZAPIN                                                                                                                                           
3 BISOPROLOL    AMLODIPIN        ASS     VALSARTAN    CHLORALDURAT     Doxozosin TAMSULOSIN     CIPRAMIL                                                                       
4        ASS    ENALAPRIL L-THYROXIN       LITALIR         LITALIR     AMLODIPIN  CETIRIZIN          HCT        NACL    CARMEN PROTEIN 88 NOVALGIN                             
5        ASS ATORVASTATIN    FOSAMAX       CALCIUM        PANTOZOL NOVAMINSULFON                                                                                               
6        ASS      FRAGMIN  TORASEMID SPIRONOLACTON LORZAAR PROTECT       VESIKUR  ROCALTROL ATORVASTATIN PREDNISOLON LACTULOSE MIRTAZAPIN   LANTUS ACTRAPID PANTOZOL SALBUTAMOL
          Med16 Med17 Med18 Med19 Med20
1                                      
2                                      
3                                      
4                                      
5                                      
6 AMPHO MORONAL                        

As you can see in the first row third column the string 'METOPROLOL' is listed. But when I call the result of my for loop for the first subject it results '0'.

> dat$betablock[1]
[1] 0

Any suggestions?

Upvotes: 0

Views: 265

Answers (3)

Uwe
Uwe

Reputation: 42592

If I understand correctly, the OP has multiple lists of agents that can be clustered for one purpose not just one list of beta blockers. The OP mentions statins, e.g. The OP wants to count how many different agents belonging to each cluster are being taken by each subject. The counts for each agent cluster are to be appended to each row.

I suggest to compute the sums for all clusters at once rather than to do this manually list by list.

For this, we first need to set-up a data frame with the clustering:

cluster
    Purpose              Agent
 1:    BETA         METOPROLOL
 2:    BETA         BISOPROLOL
 3:    BETA            NEBILET
 4:    BETA          METOHEXAL
 5:    BETA            SOTALEX
 6:    BETA             QUERTO
 7:    BETA          NEBIVOLOL
 8:    BETA         CARVEDILOL
 9:    BETA METOPROLOLSUCCINAT
10:    BETA              BELOC
11:  STATIN       ATORVASTATIN
12:  STATIN        SIMVASTATIN
13:  STATIN         LOVASTATIN
14:  STATIN        PRAVASTATIN
15:  STATIN        FLUVASTATIN
16:  STATIN         PITAVASTIN

cluster can be created, e.g., by

library(data.table)
library(magrittr)
cluster <- list(
  BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
           "QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC"),
  STATIN = c("ATORVASTATIN", "SIMVASTATIN", "LOVASTATIN", "PRAVASTATIN", 
           "FLUVASTATIN", "PITAVASTIN")
  ) %>% 
  lapply(data.table) %>% 
  rbindlist(idcol = "Purpose") %>% 
  setnames("V1", "Agent")

For counting the occurrences, we need to join or merge this table with the list of agents each subject is taking dat after dat has been reshaped from wide to long format.

While data in spreadsheet-style wide format, i.e., with one row per subject and many columns, are often suitable for data entry and inspection the database-style long format is often more suitable for data processing.

taken <- melt(setDT(dat)[, ID := .I], "ID", value.name = "Agent", na.rm = TRUE)[
  Agent != ""][
    , Agent := toupper(Agent)][]
    ID variable           Agent
 1:  1     Med1       AMLODIPIN
 2:  2     Med1          PLAVIX
 3:  3     Med1      BISOPROLOL
 4:  4     Med1             ASS
 5:  5     Med1             ASS
 6:  6     Med1             ASS
 7:  1     Med2        RAMIPRIL
 8:  2     Med2     SIMVASTATIN
 9:  3     Med2       AMLODIPIN
10:  4     Med2       ENALAPRIL
11:  5     Med2    ATORVASTATIN
12:  6     Med2         FRAGMIN
13:  1     Med3      METOPROLOL
14:  2     Med3      MIRTAZAPIN
15:  3     Med3             ASS
16:  4     Med3      L-THYROXIN
17:  5     Med3         FOSAMAX
18:  6     Med3       TORASEMID
19:  3     Med4       VALSARTAN
20:  4     Med4         LITALIR
21:  5     Med4         CALCIUM
22:  6     Med4   SPIRONOLACTON
23:  3     Med5    CHLORALDURAT
24:  4     Med5         LITALIR
25:  5     Med5        PANTOZOL
26:  6     Med5 LORZAAR PROTECT
27:  3     Med6       DOXOZOSIN
28:  4     Med6       AMLODIPIN
29:  5     Med6   NOVAMINSULFON
30:  6     Med6         VESIKUR
31:  3     Med7      TAMSULOSIN
32:  4     Med7       CETIRIZIN
33:  6     Med7       ROCALTROL
34:  3     Med8        CIPRAMIL
35:  4     Med8             HCT
36:  6     Med8    ATORVASTATIN
37:  4     Med9            NACL
38:  6     Med9     PREDNISOLON
39:  4    Med10          CARMEN
40:  6    Med10       LACTULOSE
41:  4    Med11      PROTEIN 88
42:  6    Med11      MIRTAZAPIN
43:  4    Med12        NOVALGIN
44:  6    Med12          LANTUS
45:  6    Med13        ACTRAPID
46:  6    Med14        PANTOZOL
47:  6    Med15      SALBUTAMOL
48:  6    Med16   AMPHO MORONAL
    ID variable           Agent

dat is modified by appending a row number which identifies each subject, then it is reshaped to long format using melt(). Missing or empty entries are removed and agent names are converted to uppercase for consistency.

Edit In long format it is also easy to check for duplicate agents per subject

taken[duplicated(taken, by = c("ID", "Agent"))]
   ID variable   Agent
1:  4     Med5 LITALIR

and remove the duplicates:

taken <- unique(taken, by = c("ID", "Agent"))

The final step creates what I believe is the expected result:

   ID BETA STATIN       Med1         Med2       Med3          Med4            Med5          Med6       Med7         Med8
1:  1    1      0  AMLODIPIN     RAMIPRIL METOPROLOL                                                                    
2:  2    0      1     PLAVIX  SIMVASTATIN MIRTAZAPIN                                                                    
3:  3    1      0 BISOPROLOL    AMLODIPIN        ASS     VALSARTAN    CHLORALDURAT     Doxozosin TAMSULOSIN     CIPRAMIL
4:  4    0      0        ASS    ENALAPRIL L-THYROXIN       LITALIR         LITALIR     AMLODIPIN  CETIRIZIN          HCT
5:  5    0      1        ASS ATORVASTATIN    FOSAMAX       CALCIUM        PANTOZOL NOVAMINSULFON                        
6:  6    0      1        ASS      FRAGMIN  TORASEMID SPIRONOLACTON LORZAAR PROTECT       VESIKUR  ROCALTROL ATORVASTATIN

Pleae, note the additional columns with the counts by cluster (Due to limited space not all columns of the result are shown here). This is created by

cluster[taken, on = .(Agent)][
  , dcast(.SD, ID ~ Purpose, length)][
    dat, on = "ID"][
      , "NA" := NULL][]

using the following operations:

  1. Join cluster and taken to have Purpose appended
  2. Reshape to wide format, one row per subject and one column per purpose, thereby counting the number of occurrences
  3. Join this result result with the original data dat
  4. Remove the superfluous column of NA counts

Data

dat <- structure(list(Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL", 
"ASS", "ASS", "ASS"), Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN", 
"ENALAPRIL", "ATORVASTATIN", "FRAGMIN"), Med3 = c("METOPROLOL", 
"MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"), Med4 = c("", 
"", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON"), Med5 = c("", 
"", "CHLORALDURAT", "LITALIR", "PANTOZOL", "LORZAAR PROTECT"), 
    Med6 = c("", "", "Doxozosin", "AMLODIPIN", "NOVAMINSULFON", 
    "VESIKUR"), Med7 = c("", "", "TAMSULOSIN", "CETIRIZIN", "", 
    "ROCALTROL"), Med8 = c("", "", "CIPRAMIL", "HCT", "", "ATORVASTATIN"
    ), Med9 = c("", "", "", "NACL", "", "PREDNISOLON"), Med10 = c("", 
    "", "", "CARMEN", "", "LACTULOSE"), Med11 = c("", "", "", 
    "PROTEIN 88", "", "MIRTAZAPIN"), Med12 = c("", "", "", "NOVALGIN", 
    "", "LANTUS"), Med13 = c("", "", "", "", "", "ACTRAPID"), 
    Med14 = c("", "", "", "", "", "PANTOZOL"), Med15 = c("", 
    "", "", "", "", "SALBUTAMOL"), Med16 = c("", "", "", "", 
    "", "AMPHO MORONAL")), class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 1

Martin Morgan
Martin Morgan

Reputation: 46886

Suppose the data about pharamcueticals is in a data.frame agents. Translate the elements of agents into logical values, with TRUE present when the element is in BETA

found = agents     # copy agents...
found[] = unlist(agents) %in% BETA

Then the number of beta blockers in each row (subject) are

rowSums(found)

As a toy example, here are some agents and beta blockers

agents = data.frame(x = c("a", "b", "c"), y = c("b", "c", "d"))
beta = c("a", "b")

and the results of the computation

> agents
  x y
1 a b
2 b c
3 c d
> found = agents
> found[] = unlist(agents) %in% beta
> found
      x     y
1  TRUE  TRUE
2  TRUE FALSE
3 FALSE FALSE
> rowSums(found)
[1] 2 1 0
> agents$beta_blockers = rowSums(found)
> agents
  x y beta_blockers
1 a b             2
2 b c             1
3 c d             0

This use of %in% performs an exact match -- the terms in agents are matched exactly to the terms in beta. This seems to be what the problem calls for. The original code and other solutions aim for partial match, e.g., if a row had entries "a", "aa", and "ab" then the code would compute

> stringr::str_count(c("a", "aa", "ab"), "a")
[1] 1 2 1

with sum(.) equal to 4, whereas there is only one occurrence of "a" so the correct answer is 1. Also, the code is susceptible to the default use of factors to represent columns in a data.frame, and apparently stringr does not convert factors to character vectors before doing the comparison

> agents = data.frame(x = c("a", "b", "c"), y = c("b", "c", "d")
> stringr::str_count(agents[1,], c("a", "b"))
[1] 0 0
> agents = data.frame(x = c("a", "b", "c"), y = c("b", "c", "d"), stringsAsFactors=FALSE)
> stringr::str_count(agents[1,], c("a", "b"))
[1] 1 1

Upvotes: 1

Zhiqiang Wang
Zhiqiang Wang

Reputation: 6769

EDIT:

In the following code, you can calculate the sum of each beta blockers in BETA and the sum of all beta blockers for each individual (row).

library(stringr)
df <- data.frame(
  Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL", "ASS", "ASS", "ASS"), 
  Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN","ENALAPRIL", "ATORVASTATIN", "FRAGMIN"), 
  Med3 = c("METOPROLOL", "MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"), 
  Med4 = c("BISOPROLOL", "CARVEDILOL", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON")
  ) 
BETA <- c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
         "QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC")
df$med_all <- paste(df$Med1, df$Med2, df$Med3, df$Med4, sep = ", ")
# get sum of each of the beta blockers in BETA 
sapply(BETA, function(x) sum(str_count(df$med_all, c(x))))
#>         METOPROLOL         BISOPROLOL            NEBILET          METOHEXAL 
#>                  1                  2                  0                  0 
#>            SOTALEX             QUERTO          NEBIVOLOL         CARVEDILOL 
#>                  0                  0                  0                  1 
#> METOPROLOLSUCCINAT              BELOC 
#>                  0                  0
# get sum of all blockers for each individual 
df1<-sapply(BETA, function(x) str_count(df$med_all, c(x)))
rowSums(df1)
#> [1] 2 1 1 0 0 0

Created on 2020-01-16 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions