user3707934
user3707934

Reputation: 13

regex in R to extract value between two strings

I have lines that look like this

 01:04:43.064 [12439] <2> xyz
 01:04:43.067 [12439] <2> a lmn
 01:04:43.068 [12439] <4> j klm
 x_times_wait to <3000>
 01:04:43.068 [12439] <4> j klm
 enter_object <5000> main k

I want a regex to extract only the values after the angular brackets for lines that start with a timestamp

This is what I have tried - assuming that these lines are in a data frame called nn

 split<-str_split_fixed(nn[,1], ">", 2)
 split2<-data.frame(split[,2])

The problem is that split2 gives

   xyz
   a lmn
   j klm

   j klm
   main k

How can I make sure that the empty line and main k is not returned?

Upvotes: 0

Views: 558

Answers (4)

Jim
Jim

Reputation: 4767

Using rex may make this type of task a little simpler.

string <- "01:04:43.064 [12439] <2> xyz
01:04:43.067 [12439] <2> a lmn
01:04:43.068 [12439] <4> j klm
x_times_wait to <3000>
01:04:43.068 [12439] <4> j klm
enter_object <5000> main k"

library(rex)

timestamp <- rex(n(digit, 2), ":", n(digit, 2), ":", n(digit, 2), ".", n(digit, 3))

re <- rex(timestamp, space,
          "[", digits, "]", space,
          "<", digits, ">", space,
          capture(anything))

re_matches(string, re, global = TRUE)

#> [[1]]
#>       1
#> 1   xyz
#> 2 a lmn
#> 3 j klm
#> 4 j klm

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99331

If a timestamp is defined as 1 or more digits followed by a :, followed by 1 or more digits and another : and then 1 or more digits, then perhaps this method would work for you.

x <- c("01:04:43.064 [12439] <2> xyz", "01:04:43.067 [12439] <2> a lmn",   
       "01:04:43.068 [12439] <4> j klm", "x_times_wait to <3000>",  
       "01:04:43.068 [12439] <4> j klm", "enter_object <5000> main k")

sub(".*> ", "", x[grepl("\\d+:\\d+:\\d+", x)])
# [1] "xyz"   "a lmn" "j klm" "j klm"

This removes all the non-timestamp elements first, then gets the values after > with the remaining elements.

Upvotes: 2

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

Here's an approach in base R:

The regex:

^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+

You can use it with gregexpr:

unlist(regmatches(vec, gregexpr("^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+", 
                                vec, perl = TRUE)))
# [1] "xyz"   "a lmn" "j klm" "j klm"

where vec is the vector containing your strings.

Upvotes: 0

vks
vks

Reputation: 67968

\d+(?::\d+){2}\.\d+\s+\[[^\]]+\]\s+<\d+>(.+)$

Instead of split try match and grab the group 1.See demo.

https://regex101.com/r/vN3sH3/16

or

Split by (?<=<\d>) and get split2

Upvotes: 3

Related Questions