Reputation: 65
I am continuing to work on some data cleaning practice with some animal shelter data. My goal here is to shrink down the number of breed categories.
I am using each breed category as a partial pattern match against the outgoing$Single.Breed
data frame column. So, there are cases where the breed will just be Chihuahua
, but it may also be Long Hair Chihuahua
. (Hence, my use of grepl
.) Thus, anything containing a breed category would be represented in a different column by said category. Furthermore, I also need to add the cat breed categories...making for an even messier bunch of code.
The code below is my "solution", but it's quite clunky. Is there a better, slicker and/or more efficient way to accomplish this?
BreedCategories <- ifelse(outgoing$New.Type == "Dog",
ifelse(grepl("Chihuahua",outgoing$Single.Breed, ignore.case = TRUE), "Chihuahua",
ifelse(grepl("Pit Bull",outgoing$Single.Breed, ignore.case = TRUE), "Pit Bull",
ifelse(grepl("Terrier",outgoing$Single.Breed, ignore.case = TRUE), "Terrier",
ifelse(grepl("Shepherd",outgoing$Single.Breed, ignore.case = TRUE), "Shepherd",
ifelse(grepl("Poodle",outgoing$Single.Breed, ignore.case = TRUE), "Poodle",
ifelse(grepl("Labrador|Retriever",outgoing$Single.Breed, ignore.case = TRUE),"Labrador",
"Other")))))),"Cat")
Upvotes: 3
Views: 722
Reputation: 46866
Create a data.frame that maps between the regular expression and what the breed is
map <- data.frame(
pattern=c(
"Chihuahua", "Pit Bull", "Terrier", "Shepherd",
"Poodle", "Labrador|Retriever", "Other"),
isa=c(
"Chihuahua", "Pit Bull", "Terrier", "Shepherd",
"Poodle", "Labrador", "Other"),
stringsAsFactors=FALSE)
and some data
outgoing <- data.frame(Single.Breed=c(map$isa, "Pit Bull Poodle", "Pug"),
stringsAsFactors=FALSE)
For the program, use vapply()
and grepl()
to match each pattern to the data; the use of grepl()
means that the result is a matrix, with rows corresponding to each entry
isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
if (any(rowSums(isa) > 1))
warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])
Use max.col()
to visit each row and retrieve the best (last) match (which happens to be 'Other' if there are no matches).
outgoing$BreedCategory <- map$isa[max.col(isa, "last")]
Here's the result
> isa <- vapply(map$pattern, grepl, logical(nrow(outgoing)), outgoing$Single.Breed)
> if (any(rowSums(isa) > 1))
+ warning("ambiguous breeds: ", outgoing$Single.Breed[rowSums(isa) != 1])
Warning message:
ambiguous breeds: Pit Bull Poodle
> outgoing$BreedCategory <- map$isa[max.col(isa, "last")]
> outgoing
Single.Breed BreedCategory
1 Chihuahua Chihuahua
2 Pit Bull Pit Bull
3 Terrier Terrier
4 Shepherd Shepherd
5 Poodle Poodle
6 Labrador Labrador
7 Other Other
8 Pit Bull Poodle Poodle
9 Pug Other
I guess the approach is appealing because it more clearly separates the 'data' (regex and input breeds) from the 'program' (grepl()
and max.col()
).
The handling of 'Other' seems a little fragile -- what if you forget that it is supposed to be the last element of map
? One possibility is to create an indicator variable that tests the row sums of isa, and uses this to conditionally assign the breed
test = rowSums(isa)
outgoing$BreedCategory[test == 0] = "Other"
outgoing$BreedCategory[test == 1] = map$isa[max.col(isa)][test == 1]
outgoing$BreedCategory[test > 1] = "Mixed"
The above is not very space efficient (the matrix transforms your length n data to an n x # of regex matrix), but seems likely to get the job done for say 1M input rows.
dplyr::case_when()
seems to require that you write many grepl()
statements, which is error prone.
Upvotes: 4