Reputation: 667
I have a column that contains multiple industry codes per record, separated by comma, and each with various lengths (from 2 to 6 digits). A record in my data frame looks something like:
naics <- c("5413, 541410, 11, 23611, 23, 611")
I want to be able to create a new array based on the number of characters in each unit. For example, here I'm extracting only four-digit numeric characters:
naics.four.digit <- unlist(str_extract_all(naics, "[0-9]{4}+"))
naics.four.digit
[1]"5413" "5414" "5414" "5416" "6117"
As you can see above I used str_extract_all
, and the method works well. However, this method breaks down once I try to extract 3 digit and 2 digit characters.
naics.three.digit <- unlist(str_extract_all(naics, "[0-9]{3}+"))
naics.three.digit
[1]"541" "541" "410" "236" "611"
The actual desired output here would be:
"541" "541" "236" "611"
Similarly, for the two-digit output, it should be:
"54" "54" "11" "23" "23" "61"
I assume the str_extract_all
method breaks down here because each substring comes in varying lengths. Is there a workaround for this? Any help or guidance is appreciated.
Upvotes: 2
Views: 179
Reputation: 886998
We can use word boundary \\b
followed by 3 digits (\\d{3}
) as pattern in str_extract_all
and it will skip the numbers having less than 3 digits
library(stringr)
str_extract_all(naics, "\\b\\d{3}")[[1]]
#[1] "541" "541" "236" "611"
Upvotes: 4