kimbekaw
kimbekaw

Reputation: 65

R: Improving Nested ifelse Statements & Multiple Patterns

I am continuing to work on some data cleaning practice with some animal shelter data. My goal here is to shrink down the number of breed categories.

I am using each breed category as a partial pattern match against the outgoing$Single.Breed data frame column. So, there are cases where the breed will just be Chihuahua, but it may also be Long Hair Chihuahua. (Hence, my use of grepl.) Thus, anything containing a breed category would be represented in a different column by said category. Furthermore, I also need to add the cat breed categories...making for an even messier bunch of code.

The code below is my "solution", but it's quite clunky. Is there a better, slicker and/or more efficient way to accomplish this?

BreedCategories <- ifelse(outgoing$New.Type == "Dog",
           ifelse(grepl("Chihuahua",outgoing$Single.Breed, ignore.case = TRUE), "Chihuahua",
           ifelse(grepl("Pit Bull",outgoing$Single.Breed, ignore.case = TRUE), "Pit Bull",
           ifelse(grepl("Terrier",outgoing$Single.Breed, ignore.case = TRUE), "Terrier",
           ifelse(grepl("Shepherd",outgoing$Single.Breed, ignore.case = TRUE), "Shepherd",
           ifelse(grepl("Poodle",outgoing$Single.Breed, ignore.case = TRUE), "Poodle",
           ifelse(grepl("Labrador|Retriever",outgoing$Single.Breed, ignore.case = TRUE),"Labrador",
           "Other")))))),"Cat")

Upvotes: 3

Views: 722

Answers (1)

Martin Morgan
Martin Morgan

Reputation: 46866

Create a data.frame that maps between the regular expression and what the breed is

map <- data.frame(
    pattern=c(
        "Chihuahua", "Pit Bull", "Terrier", "Shepherd",
        "Poodle", "Labrador|Retriever", "Other"),
    isa=c(
        "Chihuahua", "Pit Bull", "Terrier", "Shepherd",
        "Poodle", "Labrador", "Other"),
    stringsAsFactors=FALSE)

and some data

outgoing <- data.frame(Single.Breed=c(map$isa, "Pit Bull Poodle", "Pug"),
                       stringsAsFactors=FALSE)

For the program, use vapply() and grepl() to match each pattern to the data; the use of grepl() means that the result is a matrix, with rows corresponding to each entry

isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
if (any(rowSums(isa) > 1))
    warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])

Use max.col() to visit each row and retrieve the best (last) match (which happens to be 'Other' if there are no matches).

outgoing$BreedCategory <- map$isa[max.col(isa, "last")]

Here's the result

> isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
> if (any(rowSums(isa) > 1))
+     warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])
Warning message:
ambiguous breeds: Pit Bull Poodle 
> outgoing$BreedCategory <- map$isa[max.col(isa, "last")]
> outgoing
     Single.Breed BreedCategory
1       Chihuahua     Chihuahua
2        Pit Bull      Pit Bull
3         Terrier       Terrier
4        Shepherd      Shepherd
5          Poodle        Poodle
6        Labrador      Labrador
7           Other         Other
8 Pit Bull Poodle        Poodle
9             Pug         Other

I guess the approach is appealing because it more clearly separates the 'data' (regex and input breeds) from the 'program' (grepl() and max.col()).

The handling of 'Other' seems a little fragile -- what if you forget that it is supposed to be the last element of map? One possibility is to create an indicator variable that tests the row sums of isa, and uses this to conditionally assign the breed

test = rowSums(isa)
outgoing$BreedCategory[test == 0] = "Other"
outgoing$BreedCategory[test == 1] = map$isa[max.col(isa)][test == 1]
outgoing$BreedCategory[test > 1] = "Mixed"

The above is not very space efficient (the matrix transforms your length n data to an n x # of regex matrix), but seems likely to get the job done for say 1M input rows.

dplyr::case_when() seems to require that you write many grepl() statements, which is error prone.

Upvotes: 4

Related Questions