DiamondJoe12
DiamondJoe12

Reputation: 1809

filtering rows that only contain certain values among multiple columns in R

I have the following data, where a single row (an arrest event) can have up to 5 charges associated with it.

#statutes 1, 2, and 3 are marijuana related
#statute 4 is paraphernalia related.

caseID <- c("1", "2", "3", "4", "5", "6", "7", "8")
date <- c("2017-01-01", "2017-01-12", "2018-03-23", "2019-10-12", "2018-11-22", "2018-01-01", "2017-02-01", "2017-02-20")
charge1 <- c("Statute4", "Statute12", "Statute1", "Statute3", "Statute3", "Statute158", "Statute2", "Statute1")
charge2 <- c(NA, "Statute1", "Statute3", "Statute44", "Statute4", "Statute4", NA, "Statute4")
charge3 <- c(NA, "Statute12", NA, "Statute4", NA, NA, NA, "Statute19")
charge4 <- c(NA, "Statute6", NA, NA, NA, NA, NA, NA)
charge5 <- c(NA, "Statute8", NA, NA, NA, NA, NA, NA)

df <- data.frame(caseID, date, charge1, charge2, charge3, charge4, charge5)

I want to filter out any rows where the following conditions are true:

  1. a record contains ONLY marijuana charges (which are statute1, statute2, and statute3) OR
  2. a record contains ONLY marijuana charges (statute1, statute2, and statute3) and paraphernalia charges (statute4).

So by this logic, the following cases would be excluded/excluded:

    CaseID 1 excluded (paraphernalia only)
    CaseID 2 not excluded
    CaseID 3 excluded (marijuana only)
    CaseID 4 not excluded
    CaseID 5 excluded (marijuana and paraphernalia only)
    CaseID 6 not excluded
    CaseID 7 excluded (marijuana only)
    CaseID 8 not excluded

There are many many statutes in my real example, so I need to figure out a way to do this where I find rows that contain only a very specific set of statutes.

Upvotes: 1

Views: 80

Answers (1)

langtang
langtang

Reputation: 24722

This might work for you

library(data.table)

# set to data.table
setDT(df)

# melt long
df_long = melt(df, id.vars=c("caseID", "date"))[!is.na(value)]

#count all charges
total_charges = df_long[,.(totalcharges =.N), by=caseID]

# return the subset of the original wide dataset, with target
# rows excluded
df[df_long[value %chin% c("Statute1","Statute2","Statute3","Statute4")] %>% 
     .[, .N, caseID] %>% 
     .[total_charges, on=.(caseID)] %>% 
     .[N!=totalcharges,.(caseID)], on="caseID"]

Output

   caseID       date    charge1   charge2   charge3  charge4  charge5
1:      2 2017-01-12  Statute12  Statute1 Statute12 Statute6 Statute8
2:      4 2019-10-12   Statute3 Statute44  Statute4     <NA>     <NA>
3:      6 2018-01-01 Statute158  Statute4      <NA>     <NA>     <NA>
4:      8 2017-02-20   Statute1  Statute4 Statute19     <NA>     <NA>

Upvotes: 1

Related Questions