Reputation: 766
I have a data frame in which I would like to adjust a character vector before plotting it. My data frame, available here, is 140,000+ rows with approx. 40 labels denoting a location - in this case a local government area in Sydney. Currently each of the names in the 'LGA_NAME11' column are followed by a (A) or (C) which denotes the type of local government area they are. I'm interested in removing the brackets.
I'm currently using ifelse statements to replace the current values with an appended one. To call it suboptimal would be an understatement. I've been writing a statement for each variation.
sydneyMapData <- sydneyMapData %>%
mutate(LGA_NAME11 =
ifelse(LGA_NAME11 == "Ashfield (A)", "Ashfield",
ifelse(LGA_NAME11 == "Auburn (C)", "Auburn",
ifelse(LGA_NAME11 == "Bankstown (C)", "Bankstown",
1))))
etc...
I'm also repeating this exercise on a larger dataset and R doesn't appear to like it when I have >50 ifelse statements.
I'm interested in trying to find a simpler dplyr solution (mainly because I love dplyr)... and it would improve my workflow elsewhere. I can't help but think it should be possible. In the possible event I am wrong, I'd be open to any suggestions! Thanks in advance.
Upvotes: 1
Views: 3268
Reputation: 18995
If you want a dplyr
solution, is mutate
not the simplest one?
If you just want to get rid of the ()'s and everything in them
sub("\\s*\\(.*\\)$","","Ashfield (A)") # returns "Ashfield"
If you want to keep as a separate variable the local govt type that's in the ()'s:
sub("^.*\\((.*)\\)$","\\1","Ashfield (A)") # returns "A"
Thus
sydneyMapData %>%
mutate(local_govt_type = sub("^.*\\((.*)\\)$","\\1",LGA_NAME11),
LGA_NAME11 = sub("\\s*\\(.*\\)$","", LGA_NAME11) ) -> sydneyMapData
Upvotes: 1
Reputation: 886938
You could use sub
v1 <- c("Ashfield (A)", "Auburn (C)", "Bankstown (C)")
sub(' \\([^)]+\\).*$', '', v1)
#[1] "Ashfield" "Auburn" "Bankstown"
Using your original dataset
dim(sydneyMapData)
#[1] 142459 13
system.time(sydneyMapData$LGA_NAME11 <- sub(' \\([^)]+\\).*$', '',
sydneyMapData$LGA_NAME11))
# user system elapsed
# 0.087 0.000 0.088
head(sydneyMapData,2)
# LGA_NAME11 id long lat order hole piece group STATE_CODE
#1 1 Ashfield 2 151.1212 -33.89556 85104 FALSE 1 2.1 1
#2 2 Ashfield 2 151.1211 -33.89556 85105 FALSE 1 2.1 1
# LGA_CODE11 Factor1 Factor2
#1 10150 10-14.99 200-500
#2 10150 10-14.99 200-500
Using extract
from tidyr
library(tidyr)
system.time(extract(sydneyMapData, LGA_NAME11,
into='LGA_NAME11', '([^\\( ]+) \\(.*\\)'))
# user system elapsed
# 1.631 0.001 1.636
Or
library(stringi)
system.time(stri_extract(sydneyMapData[,2], regex='^[^\\( ]+'))
# user system elapsed
# 0.051 0.000 0.047
Based on the data provided, the below code worked
sydneyMapData$LGA_NAME11[c(3,8)] <- 'Other'
res <- extract(sydneyMapData, LGA_NAME11,
into='LGA_NAME11', '([^\\( ]+)')
head(res$LGA_NAME11)
#[1] "Ashfield" "Ashfield" "Other" "Ashfield" "Ashfield" "Ashfield"
sydneyMapData <- read.csv('mapData.csv', header=TRUE,
check.names=FALSE, stringsAsFactors=FALSE)
Upvotes: 4