Sharif Amlani
Sharif Amlani

Reputation: 1278

How to split a string after the nth character in r

I am working with the following data:

District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")

I want to split the string after the second character and put them into two columns.

So that the data looks like this:

state  district
AR        01
AZ        03
AZ        05
AZ        08
CA        01
CA        05
CA        11
CA        16
CA        18
CA        21

Is there a simple code to get this done? Thanks so much for you help

Upvotes: 5

Views: 9507

Answers (6)

J_F
J_F

Reputation: 10352

With the tidyverse this is very easy using the function separate from tidyr:

library(tidyverse)
District %>% 
  as.tibble() %>% 
  separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")

# A tibble: 10 × 2
   state district
   <chr> <chr>   
 1 AR    01      
 2 AZ    03      
 3 AZ    05      
 4 AZ    08      
 5 CA    01      
 6 CA    05      
 7 CA    11      
 8 CA    16      
 9 CA    18      
10 CA    21      

Upvotes: 1

zx8754
zx8754

Reputation: 56054

Treat it as fixed width file, and import:

# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
#    V1 V2
# 1  AR 01
# 2  AZ 03
# 3  AZ 05
# 4  AZ 08
# 5  CA 01
# 6  CA 05
# 7  CA 11
# 8  CA 16
# 9  CA 18
# 10 CA 21

Upvotes: 0

Uwe
Uwe

Reputation: 42544

The OP has written

I'm more familiar with strsplit(). But since there is nothing to split on, its not applicable in this case

Au contraire! There is something to split on and it's called lookbehind:

strsplit(District, "(?<=[A-Z]{2})", perl = TRUE) 

The lookbehind works like "inserting an invisible break" after 2 capital letters and splits the strings there.

The result is a list of vectors

[[1]]
[1] "AR" "01"

[[2]]
[1] "AZ" "03"

[[3]]
[1] "AZ" "05"

[[4]]
[1] "AZ" "08"

[[5]]
[1] "CA" "01"

[[6]]
[1] "CA" "05"

[[7]]
[1] "CA" "11"

[[8]]
[1] "CA" "16"

[[9]]
[1] "CA" "18"

[[10]]
[1] "CA" "21"

which can be turned into a matrix, e.g., by

do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
      [,1] [,2]
 [1,] "AR" "01"
 [2,] "AZ" "03"
 [3,] "AZ" "05"
 [4,] "AZ" "08"
 [5,] "CA" "01"
 [6,] "CA" "05"
 [7,] "CA" "11"
 [8,] "CA" "16"
 [9,] "CA" "18"
[10,] "CA" "21"

Upvotes: 5

Ronak Shah
Ronak Shah

Reputation: 388862

We can use str_match to capture first two characters and the remaining string in separate columns.

stringr::str_match(District, "(..)(.*)")[, -1]

#      [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"

Upvotes: 1

Onyambu
Onyambu

Reputation: 79208

you could use strcapture from base R:

 strcapture("(\\w{2})(\\w{2})",District,
                    data.frame(state = character(),District = character()))
   state District
1     AR       01
2     AZ       03
3     AZ       05
4     AZ       08
5     CA       01
6     CA       05
7     CA       11
8     CA       16
9     CA       18
10    CA       21

where \\w{2} means two words

Upvotes: 5

Mike
Mike

Reputation: 4370

You can use substr if you always want to split by the second character.

District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district  starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)

Upvotes: 9

Related Questions