vengefulsealion
vengefulsealion

Reputation: 766

How can I rename a string of character vectors using dplyr?

I have a data frame in which I would like to adjust a character vector before plotting it. My data frame, available here, is 140,000+ rows with approx. 40 labels denoting a location - in this case a local government area in Sydney. Currently each of the names in the 'LGA_NAME11' column are followed by a (A) or (C) which denotes the type of local government area they are. I'm interested in removing the brackets.

I'm currently using ifelse statements to replace the current values with an appended one. To call it suboptimal would be an understatement. I've been writing a statement for each variation.

sydneyMapData <- sydneyMapData %>%
    mutate(LGA_NAME11 =
            ifelse(LGA_NAME11 == "Ashfield (A)", "Ashfield",
            ifelse(LGA_NAME11 == "Auburn (C)", "Auburn",
            ifelse(LGA_NAME11 == "Bankstown (C)", "Bankstown",
            1))))
            etc...

I'm also repeating this exercise on a larger dataset and R doesn't appear to like it when I have >50 ifelse statements.

I'm interested in trying to find a simpler dplyr solution (mainly because I love dplyr)... and it would improve my workflow elsewhere. I can't help but think it should be possible. In the possible event I am wrong, I'd be open to any suggestions! Thanks in advance.

Upvotes: 1

Views: 3268

Answers (2)

C8H10N4O2
C8H10N4O2

Reputation: 18995

If you want a dplyr solution, is mutate not the simplest one?

If you just want to get rid of the ()'s and everything in them

sub("\\s*\\(.*\\)$","","Ashfield (A)") # returns "Ashfield"

If you want to keep as a separate variable the local govt type that's in the ()'s:

sub("^.*\\((.*)\\)$","\\1","Ashfield (A)")   # returns "A"

Thus

sydneyMapData %>% 
     mutate(local_govt_type = sub("^.*\\((.*)\\)$","\\1",LGA_NAME11),
            LGA_NAME11 = sub("\\s*\\(.*\\)$","", LGA_NAME11) ) -> sydneyMapData

Upvotes: 1

akrun
akrun

Reputation: 886938

You could use sub

v1 <- c("Ashfield (A)", "Auburn (C)", "Bankstown (C)")
sub(' \\([^)]+\\).*$', '', v1)
#[1] "Ashfield"  "Auburn"    "Bankstown"

Using your original dataset

dim(sydneyMapData)
#[1] 142459     13
system.time(sydneyMapData$LGA_NAME11 <- sub(' \\([^)]+\\).*$', '', 
             sydneyMapData$LGA_NAME11))
#  user  system elapsed 
# 0.087   0.000   0.088 
head(sydneyMapData,2)
#   LGA_NAME11 id     long       lat order  hole piece group STATE_CODE
#1 1   Ashfield  2 151.1212 -33.89556 85104 FALSE     1   2.1          1
#2 2   Ashfield  2 151.1211 -33.89556 85105 FALSE     1   2.1          1
#  LGA_CODE11  Factor1 Factor2
#1      10150 10-14.99 200-500
#2      10150 10-14.99 200-500

Using extract from tidyr

library(tidyr)
system.time(extract(sydneyMapData, LGA_NAME11, 
          into='LGA_NAME11', '([^\\( ]+) \\(.*\\)'))
#   user  system elapsed 
#  1.631   0.001   1.636 

Or

library(stringi)
system.time(stri_extract(sydneyMapData[,2], regex='^[^\\( ]+'))
 # user  system elapsed 
# 0.051   0.000   0.047 

Update

Based on the data provided, the below code worked

sydneyMapData$LGA_NAME11[c(3,8)] <- 'Other'
res <- extract(sydneyMapData, LGA_NAME11, 
                       into='LGA_NAME11', '([^\\( ]+)')
head(res$LGA_NAME11)
#[1] "Ashfield" "Ashfield" "Other"    "Ashfield" "Ashfield" "Ashfield"

data

sydneyMapData <- read.csv('mapData.csv', header=TRUE, 
             check.names=FALSE, stringsAsFactors=FALSE)

Upvotes: 4

Related Questions