Reputation: 444
I have a data frame, ‘df
’. The data frame is quite large. The data is quite fuzzy; it contains misspells, no constant pattern etc. see example
structure(list(ABC = structure(c(1L, 3L, 4L, 6L, 8L, 9L, 5L,
11L, 2L, 7L, 10L), .Label = c("2-8-2010 14:42:00 (number not ok)",
"2-8-2010 18:42:00 (nuber is not oke)", "2-8-2010 18:42:00 (number is not ok)",
"2-9-2010 14:47:00 (? Not ok )", "23:59 missing &^%", "26-9-2010 23.24",
"26-9-2010 23.24 not (working)", "26-9-2010 23.28 note: shutdown number!)",
"26-9-2010 23.29 (missing brackets", "Im oke and working\n",
"number"), class = "factor")), .Names = "ABC", row.names = c(NA,
-11L), class = "data.frame")
Q) How to recode a string variable based on a match with a target string?
In my case how to recode a the variable ‘ABC’ when the strings matches the words “not working” and “number is not ok” and when there is a match, create variable XYZ labeled ‘present’ etc. I’m aiming for this:
structure(list(ABC = structure(c(2L, 4L, 5L, 7L, 9L, 10L, 6L,
1L, 12L, 3L, 8L, 11L), .Label = c("", "2-8-2010 14:42:00 (number not ok)",
"2-8-2010 18:42:00 (nuber is not oke)", "2-8-2010 18:42:00 (number is not ok)",
"2-9-2010 14:47:00 (? Not ok )", "23:59 missing &^%", "26-9-2010 23.24",
"26-9-2010 23.24 not (working)", "26-9-2010 23.28 note: shutdown number!)",
"26-9-2010 23.29 (missing brackets", "Im oke and working\tabsent\n",
"number"), class = "factor"), XYZ = structure(list(XYZ = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L), .Label = c("absent",
"missing", "present"), class = "factor")), .Names = "XYZ", class = "data.frame", row.names = c(NA,
-12L))), .Names = c("ABC", "XYZ"), row.names = c(NA, -12L), class = "data.frame")
I know, there are some examples on Stack that look the same but, I could not getting them working. I hope someone can push me in the right direction.
Thank you
Upvotes: 1
Views: 53
Reputation: 2240
A different solution without grep. You can add as many clauses as you want to.
regexpr('string_to_look_for',variable) returns the position in the string. So if that evaluates to anything other than zero, it found the string.
df$XYZ <- ifelse(regexpr('number is not ok',df$ABC)>0 |
regexpr('not working',df$ABC)>0 |
regexpr('not',df$ABC)>0,"present","absent")
ABC XYZ
1 2-8-2010 14:42:00 (number not ok) present
2 2-8-2010 18:42:00 (number is not ok) present
3 2-9-2010 14:47:00 (? Not ok ) absent
4 26-9-2010 23.24 absent
5 26-9-2010 23.28 note: shutdown number!) present
6 26-9-2010 23.29 (missing brackets absent
7 23:59 missing &^% absent
8 number absent
9 2-8-2010 18:42:00 (nuber is not oke) present
10 26-9-2010 23.24 not (working) present
11 Im oke and working\n absent
Notice that the last clause looking for 'not' actually found that in "note". If you know exactly the strings to look for you can hard code them. @mlegge code is much more elegant, but harder to understand if you are a noob like me.
Upvotes: 0
Reputation: 6913
> df$XYZ <- ifelse(grepl("not.*working|number.*[is]?.*not.*ok", df$ABC, ignore.case = TRUE), "present", "absent")
> df
ABC XYZ
1 2-8-2010 14:42:00 (number not ok) present
2 2-8-2010 18:42:00 (number is not ok) present
3 2-9-2010 14:47:00 (? Not ok ) absent
4 26-9-2010 23.24 absent
5 26-9-2010 23.28 note: shutdown number!) absent
6 26-9-2010 23.29 (missing brackets absent
7 23:59 missing &^% absent
8 number absent
9 2-8-2010 18:42:00 (nuber is not oke) absent
10 26-9-2010 23.24 not (working) present
11 Im oke and working\n absent
Upvotes: 1