Ashish Bandhu
Ashish Bandhu

Reputation: 25

How to correct the output generated through str_detect/str_contains in R

I just have a column "methods_discussed" in CSV (link is https://github.com/pandas-dev/pandas/files/3496001/multiple_responses.zip)

multi<- read.csv("multiple_responses.csv", header = T)

This file having values name of family planning methods in the column name like:

methods_discussed

emergency female_sterilization male_sterilization iud NaN injectables male_condoms -77 male_condoms female_sterilization male_sterilization injectables iud male_condoms

I have created a vector of all but not -77 and NAN of 8 family planning methods as:

method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')

I want to create new indicator variable based on the names of vector (method_names) in the existing data frame multi2, for this I used (I)

    for (abc in method_names) { 
multi2[abc]<- as.integer(str_detect(multi2$methods_discussed, fixed(abc)))
}

(II)

    for (abc in method_names) { 
multi2[abc]<- as.integer(str_contains(abc,multi2$methods_discussed)) 
}

(III) I also tried

   for (abc in method_names) {
      multi2[abc]<- as.integer(stri_detect_fixed(multi2$methods_discussed, abc))
      }

but the output is not matching as expected. Probably male_sterilization is a substring of female_sterilization and it shows 1(TRUE) for male_sterilization for female_sterlization also. It is shown below in the Actual output at row 2. It must show 0 (FALSE) as female_sterilization is in the method_discussed column at row 2. I also don't want to generate any thing like 0/1 (False/True) (should be blank) corresponding to -77 and blank in method_discussed (All are highlighted in Expected output.

Actual Output Actual Output

Expected Output Expected Output No error in code but only in the output.

Upvotes: 0

Views: 215

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389315

You can add word boundaries to fix that issue.

multi<- read.csv("multiple_responses.csv", header = T)
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')

for (abc in method_names) { 
  multi[abc]<- as.integer(grepl(paste0('\\b', abc, '\\b'), multi$methods_discussed))
}

multi[multi$methods_discussed %in% c('', -77), method_names] <- ''

Upvotes: 1

Related Questions