Carlos Xavier Bonilla
Carlos Xavier Bonilla

Reputation: 667

Extracting every nth character from delimited array using regex

I have a column that contains multiple industry codes per record, separated by comma, and each with various lengths (from 2 to 6 digits). A record in my data frame looks something like:

naics <- c("5413, 541410, 11, 23611, 23, 611")

I want to be able to create a new array based on the number of characters in each unit. For example, here I'm extracting only four-digit numeric characters:

naics.four.digit <- unlist(str_extract_all(naics, "[0-9]{4}+"))
naics.four.digit
[1]"5413" "5414" "5414" "5416" "6117"

As you can see above I used str_extract_all, and the method works well. However, this method breaks down once I try to extract 3 digit and 2 digit characters.

naics.three.digit <- unlist(str_extract_all(naics, "[0-9]{3}+"))
naics.three.digit
[1]"541" "541" "410" "236" "611"

The actual desired output here would be:

"541" "541" "236" "611"

Similarly, for the two-digit output, it should be:

"54" "54" "11" "23" "23" "61"

I assume the str_extract_all method breaks down here because each substring comes in varying lengths. Is there a workaround for this? Any help or guidance is appreciated.

Upvotes: 2

Views: 179

Answers (1)

akrun
akrun

Reputation: 886998

We can use word boundary \\b followed by 3 digits (\\d{3}) as pattern in str_extract_all and it will skip the numbers having less than 3 digits

library(stringr)
str_extract_all(naics, "\\b\\d{3}")[[1]]
#[1] "541" "541" "236" "611"

Upvotes: 4

Related Questions