Adam
Adam

Reputation: 444

Create new variable by matching strings

I have a data frame, ‘df’. The data frame is quite large. The data is quite fuzzy; it contains misspells, no constant pattern etc. see example

structure(list(ABC = structure(c(1L, 3L, 4L, 6L, 8L, 9L, 5L, 
11L, 2L, 7L, 10L), .Label = c("2-8-2010  14:42:00 (number not ok)", 
"2-8-2010  18:42:00 (nuber is not oke)", "2-8-2010  18:42:00 (number is not ok)", 
"2-9-2010  14:47:00 (? Not ok )", "23:59 missing &^%", "26-9-2010 23.24", 
"26-9-2010 23.24 not (working)", "26-9-2010 23.28 note: shutdown number!)", 
"26-9-2010 23.29 (missing brackets", "Im oke and working\n", 
"number"), class = "factor")), .Names = "ABC", row.names = c(NA, 
-11L), class = "data.frame")

Q) How to recode a string variable based on a match with a target string?

In my case how to recode a the variable ‘ABC’ when the strings matches the words “not working” and “number is not ok” and when there is a match, create variable XYZ labeled ‘present’ etc. I’m aiming for this:

structure(list(ABC = structure(c(2L, 4L, 5L, 7L, 9L, 10L, 6L, 
1L, 12L, 3L, 8L, 11L), .Label = c("", "2-8-2010  14:42:00 (number not ok)", 
"2-8-2010  18:42:00 (nuber is not oke)", "2-8-2010  18:42:00 (number is not ok)", 
"2-9-2010  14:47:00 (? Not ok )", "23:59 missing &^%", "26-9-2010 23.24", 
"26-9-2010 23.24 not (working)", "26-9-2010 23.28 note: shutdown number!)", 
"26-9-2010 23.29 (missing brackets", "Im oke and working\tabsent\n", 
"number"), class = "factor"), XYZ = structure(list(XYZ = structure(c(3L, 
3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L), .Label = c("absent", 
"missing", "present"), class = "factor")), .Names = "XYZ", class = "data.frame", row.names = c(NA, 
-12L))), .Names = c("ABC", "XYZ"), row.names = c(NA, -12L), class = "data.frame")

I know, there are some examples on Stack that look the same but, I could not getting them working. I hope someone can push me in the right direction.

Thank you

Upvotes: 1

Views: 53

Answers (2)

akaDrHouse
akaDrHouse

Reputation: 2240

A different solution without grep. You can add as many clauses as you want to.

regexpr('string_to_look_for',variable) returns the position in the string. So if that evaluates to anything other than zero, it found the string.

df$XYZ <- ifelse(regexpr('number is not ok',df$ABC)>0 |
                     regexpr('not working',df$ABC)>0 |
                     regexpr('not',df$ABC)>0,"present","absent")

                                   ABC     XYZ
1       2-8-2010  14:42:00 (number not ok) present
2    2-8-2010  18:42:00 (number is not ok) present
3           2-9-2010  14:47:00 (? Not ok )  absent
4                          26-9-2010 23.24  absent
5  26-9-2010 23.28 note: shutdown number!) present
6        26-9-2010 23.29 (missing brackets  absent
7                        23:59 missing &^%  absent
8                                   number  absent
9    2-8-2010  18:42:00 (nuber is not oke) present
10           26-9-2010 23.24 not (working) present
11                    Im oke and working\n  absent

Notice that the last clause looking for 'not' actually found that in "note". If you know exactly the strings to look for you can hard code them. @mlegge code is much more elegant, but harder to understand if you are a noob like me.

Upvotes: 0

mlegge
mlegge

Reputation: 6913

> df$XYZ <- ifelse(grepl("not.*working|number.*[is]?.*not.*ok", df$ABC, ignore.case = TRUE), "present", "absent")
> df
                                       ABC     XYZ
1       2-8-2010  14:42:00 (number not ok) present
2    2-8-2010  18:42:00 (number is not ok) present
3           2-9-2010  14:47:00 (? Not ok )  absent
4                          26-9-2010 23.24  absent
5  26-9-2010 23.28 note: shutdown number!)  absent
6        26-9-2010 23.29 (missing brackets  absent
7                        23:59 missing &^%  absent
8                                   number  absent
9    2-8-2010  18:42:00 (nuber is not oke)  absent
10           26-9-2010 23.24 not (working) present
11                    Im oke and working\n  absent

Upvotes: 1

Related Questions