qnp1521
qnp1521

Reputation: 896

R stringr with multiple regexes

I'm trying to filter a character vector created from pdf_ocr_text using multiple regex expressions. Specifically, I want to select elements that either (1) start with a digit or (2) with two spaces and a digit. I also want to keep the space in the string. Here's a reproducible example.

df <- c("  065074                         10/1/91   10/1/96 8 10 5  ", 
"060227                          10/1/93   10/1/93 9 5 5  ", 
"  060178                  10/1/95   10/1/98 8 10 5  ", "060294                      10/1/91   10/1/98 8 10 5  ", 
"060212                 10/1/91   10/1/93 8 10 5   ", "  060228                   10/1/92   10/1/92 9 5 5  ", 
"  060257                        10/1/92   10/1/92 9 5 5   ", 
"060348                     10/1/91   10/1/93 8 10 5  ", "  080379                    10/1/91   10/1/96 6 20 5   ", 
"  060239                 10/1/91   10/1/98 8 10 5  ", "  060012                      10/1/92   10/1/92 9 5 5  ", 
"  060360                    10/1/96   10/1/96 9 5 5  ", "   060035                     10/1/95   10/1/95 9 5 5  ", 
"  060243                     10/1/92   10/1/93 8 10 5  ", "  060262                   10/1/92 ; 10/1/94 7 15 5  ", 
"            =          =          ", "                                    40097       2      4 40097 _"
)

I've tried the following but it doesn't seem to work. However, if I use only one of the two conditions, it works.

df[df %>% str_detect(., "^\\s{2}\\d | ^\\d")]. # This fails
df[df %>% str_detect(., "^\\d")]. # With only one condition, it works
[1] "060227                          10/1/93   10/1/93 9 5 5  " "060294                      10/1/91   10/1/98 8 10 5  "   
[3] "060212                 10/1/91   10/1/93 8 10 5   "        "060348                     10/1/91   10/1/93 8 10 5  "    

How can I use two regex expressions as a pattern?

Upvotes: 0

Views: 211

Answers (3)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521249

Try using grep here with the pattern ^\\s{2}?\\d:

grep('^\\s{2}?\\d', df)

Here is an explanation of the regex pattern:

^       from the start of the string
\s{2}?  match 2 spaces, zero or one times (read: match two spaces, or no spaces)
\d      match a single digit

Upvotes: 1

jared_mamrot
jared_mamrot

Reputation: 26515

Using your existing approach, drop the spaces surrounding the pipe char:

df[df %>% str_detect("^\\s{2}\\d|^\\d")]

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388982

Using grep :

grep('^\\s{2}\\d|^\\d', df, value = TRUE)

# [1] "  065074                         10/1/91   10/1/96 8 10 5  "
# [2] "060227                          10/1/93   10/1/93 9 5 5  "  
# [3] "  060178                  10/1/95   10/1/98 8 10 5  "       
# [4] "060294                      10/1/91   10/1/98 8 10 5  "     
# [5] "060212                 10/1/91   10/1/93 8 10 5   "         
# [6] "  060228                   10/1/92   10/1/92 9 5 5  "       
# [7] "  060257                        10/1/92   10/1/92 9 5 5   " 
# [8] "060348                     10/1/91   10/1/93 8 10 5  "      
# [9] "  080379                    10/1/91   10/1/96 6 20 5   "    
#[10] "  060239                 10/1/91   10/1/98 8 10 5  "        
#[11] "  060012                      10/1/92   10/1/92 9 5 5  "    
#[12] "  060360                    10/1/96   10/1/96 9 5 5  "      
#[13] "  060243                     10/1/92   10/1/93 8 10 5  "    
#[14] "  060262                   10/1/92 ; 10/1/94 7 15 5  "      

Or if you prefer stringr you can use str_subset with the same pattern :

stringr::str_subset(df, '^\\s{2}\\d|^\\d')

You can also combine the two patterns with an optional 2 character whitespace.

grep('^(\\s{2})?\\d', df, value = TRUE)

Upvotes: 1

Related Questions