Asha Collie
Asha Collie

Reputation: 13

Regular Expression: numbers with specific amount of digits in specific order in R

I have a business.id column in a data frame called total_pop that contains only number that contain anywhere between 1 and 4 digits. I'm trying to extract the numbers that only contain 4 digits AND ALSO begin with "13".

Sample Data:

sex   age    business.id
-------------------------
1     23     13
1     36     465
2     42     1309
1     19     1375
2     38     137

Desired Result:

sex   age    business.id
-------------------------
2     42     1309
1     19     1375

I've tried: grep("{4}^[1][3]",total_pop$business.id,value=T) but it returns numbers with any amount of digits starting with 13. So it returns 136 and 13.

Upvotes: 1

Views: 508

Answers (3)

crestor
crestor

Reputation: 1486

library(tidyverse)

df <- tibble::tribble(
  ~sex, ~age, ~business.id,
  1L,  23L,          13L,
  1L,  36L,         465L,
  2L,  42L,        1309L,
  1L,  19L,        1375L,
  2L,  38L,         137L
)
df %>%
  filter(str_detect(business.id, "13\\d{2}"))
#> # A tibble: 2 x 3
#>     sex   age business.id
#>   <int> <int>       <int>
#> 1     2    42        1309
#> 2     1    19        1375

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 270378

1) nchar counts the number of characters and substr extracts the first two characters.

subset(total_pop, nchar(business.id) == 4 & substr(business.id, 1, 2) == 13)
##   sex age business.id
## 3   2  42        1309
## 4   1  19        1375

2) We can use a regular expression to grep out the values of interest. ^ matches the start of the business.id, .. match any two characters and $ matches the end.

subset(total_pop, grepl("^13..$", business.id))
##   sex age business.id
## 3   2  42        1309
## 4   1  19        1375

Note

The input in reproducible form:

total_pop <- structure(list(sex = c(1L, 1L, 2L, 1L, 2L), age = c(23L, 36L, 
42L, 19L, 38L), business.id = c(13L, 465L, 1309L, 1375L, 137L
)), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522762

I would handle this numerically:

df[df$business.id >= 1000 & floor(df$business.id / 100) == 13, ]

sex age business.id
3   2  42        1309
4   1  19        1375

If you wanted to handle this using business.id as a string, then we could use grepl:

df[grepl("^13\\d{2}$", df$business.id), ]

Upvotes: 1

Related Questions