Reputation: 896
I'm trying to filter a character vector created from pdf_ocr_text
using multiple regex
expressions. Specifically, I want to select elements that either (1) start with a digit or (2) with two spaces and a digit. I also want to keep the space in the string. Here's a reproducible example.
df <- c(" 065074 10/1/91 10/1/96 8 10 5 ",
"060227 10/1/93 10/1/93 9 5 5 ",
" 060178 10/1/95 10/1/98 8 10 5 ", "060294 10/1/91 10/1/98 8 10 5 ",
"060212 10/1/91 10/1/93 8 10 5 ", " 060228 10/1/92 10/1/92 9 5 5 ",
" 060257 10/1/92 10/1/92 9 5 5 ",
"060348 10/1/91 10/1/93 8 10 5 ", " 080379 10/1/91 10/1/96 6 20 5 ",
" 060239 10/1/91 10/1/98 8 10 5 ", " 060012 10/1/92 10/1/92 9 5 5 ",
" 060360 10/1/96 10/1/96 9 5 5 ", " 060035 10/1/95 10/1/95 9 5 5 ",
" 060243 10/1/92 10/1/93 8 10 5 ", " 060262 10/1/92 ; 10/1/94 7 15 5 ",
" = = ", " 40097 2 4 40097 _"
)
I've tried the following but it doesn't seem to work. However, if I use only one of the two conditions, it works.
df[df %>% str_detect(., "^\\s{2}\\d | ^\\d")]. # This fails
df[df %>% str_detect(., "^\\d")]. # With only one condition, it works
[1] "060227 10/1/93 10/1/93 9 5 5 " "060294 10/1/91 10/1/98 8 10 5 "
[3] "060212 10/1/91 10/1/93 8 10 5 " "060348 10/1/91 10/1/93 8 10 5 "
How can I use two regex expressions as a pattern?
Upvotes: 0
Views: 211
Reputation: 521249
Try using grep
here with the pattern ^\\s{2}?\\d
:
grep('^\\s{2}?\\d', df)
Here is an explanation of the regex pattern:
^ from the start of the string
\s{2}? match 2 spaces, zero or one times (read: match two spaces, or no spaces)
\d match a single digit
Upvotes: 1
Reputation: 26515
Using your existing approach, drop the spaces surrounding the pipe char:
df[df %>% str_detect("^\\s{2}\\d|^\\d")]
Upvotes: 2
Reputation: 388982
Using grep
:
grep('^\\s{2}\\d|^\\d', df, value = TRUE)
# [1] " 065074 10/1/91 10/1/96 8 10 5 "
# [2] "060227 10/1/93 10/1/93 9 5 5 "
# [3] " 060178 10/1/95 10/1/98 8 10 5 "
# [4] "060294 10/1/91 10/1/98 8 10 5 "
# [5] "060212 10/1/91 10/1/93 8 10 5 "
# [6] " 060228 10/1/92 10/1/92 9 5 5 "
# [7] " 060257 10/1/92 10/1/92 9 5 5 "
# [8] "060348 10/1/91 10/1/93 8 10 5 "
# [9] " 080379 10/1/91 10/1/96 6 20 5 "
#[10] " 060239 10/1/91 10/1/98 8 10 5 "
#[11] " 060012 10/1/92 10/1/92 9 5 5 "
#[12] " 060360 10/1/96 10/1/96 9 5 5 "
#[13] " 060243 10/1/92 10/1/93 8 10 5 "
#[14] " 060262 10/1/92 ; 10/1/94 7 15 5 "
Or if you prefer stringr
you can use str_subset
with the same pattern :
stringr::str_subset(df, '^\\s{2}\\d|^\\d')
You can also combine the two patterns with an optional 2 character whitespace.
grep('^(\\s{2})?\\d', df, value = TRUE)
Upvotes: 1