Bala Deshpande
Bala Deshpande

Reputation: 165

lookbehind in str_extract with R

I have the following text file

[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:42:57, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:43:00, 10.100.120.120, unknown]: spatial_monitor: Kurt entered Conference Room (Computer desk contains Person role)
[01/29/14 16:43:02, 10.100.120.120, unknown]: spatial_monitor: Kurt left Conference Room (Computer desk contains Person role)
[01/29/14 16:43:03, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:43:08, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:46:07, 10.100.120.120, unknown]: spatial_monitor: Fred entered Conference Room (Zone Role contains Person role)
[01/29/14 16:46:08, 10.100.120.120, unknown]: spatial_monitor: Fred left Conference Room (Zone Role contains Person role)

I am trying to use str_extract in R (in library stringr) to extract the names of locations ("Conference Room" in example above). The logic is to pull the portion of string which follows the words "entered" or "left". To this end, i have the following regular expression

(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+

This works fine in Notepad++, however when i embed this in R, i get the following error

> tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
> str_extract(tt, '(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+')
Error in regexpr("(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+", "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)",  : 
  invalid regular expression '(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+', reason 'Invalid regexp'

Other answers tell me that lookahead and lookbehind only work with Perl. So the question is how to enable Perl with str_extract? Or is there a better way of doing this? Thanks in advance.

Upvotes: 1

Views: 2211

Answers (2)

lukeA
lukeA

Reputation: 54237

library(stringr)
tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
str_extract(tt, perl('(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+'))
# [1] "Conference Room"

Update: With stringr 1.3.0 2018-02-19, perl() was removed. You can now simply do str_extract(tt, '(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+').

Upvotes: 4

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

Your regex is valid. It works with sub if you specify perl = TRUE. You can also use the sub function for your task:

sub('.*(?<=entered\\s)([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt, perl = TRUE)
# [1] "Conference Room"

Alternatively, without perl:

sub('.*entered\\s([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt)
# [1] "Conference Room"

Upvotes: 3

Related Questions