Mark Miller
Mark Miller

Reputation: 13113

extract irregular numeric data from strings

I have data like below. I wish to extract the first and last year from each string here called my.string. Some strings only contain one year and some strings contain no years. No strings contain more than two years. I have provided the desired result in the object named desired.result below the example data set. I am using R.

When a string contains two years those years are contained within a portion of the string that looks like this ga49.51 or ea22.24

When a string contains only one year that year is contained in a portion of the string that looks like this: time11

I know a bit about regex, but this problem seems too irregular and complex for me to figure out. I am not even sure where to begin. Thank you for any advice.

EDIT

Perhaps delete the numbers before the first colon (:) and the remaining numbers are what I want.

my.data <- read.table(text = '

                my.string  cov1 cov2

           42:Alpha:ga6.8  -0.1  2.2
          43:Alpha:ga9.11  -2.5  0.6
         44:Alpha:ga30.32  -1.3  0.5
         45:Alpha:ga49.51  -2.5  0.6
   50:Alpha:time1:ga.time  -1.7  0.9
   51:Alpha:time2:ga.time  -1.5  0.8
   52:Alpha:time3:ga.time  -1.0  1.0
             2:Beta:ea2.9  -1.7  0.6
           3:Beta:ea17.19  -5.0  0.8
           4:Beta:ea22.24  -6.4  1.0
                8:Beta:as   0.2  0.6
                9:Beta:sd   1.7  0.4
     12:Beta:time1:ea.tim  -2.6  1.8
     13:Beta:time10:ea.ti  -3.6  1.1
     14:Beta:time11:ea.ti  -3.1  0.7

', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")

desired.result <- read.table(text = '

                my.string  cov1 cov2  time1  time2

           42:Alpha:ga6.8  -0.1  2.2      6      8
          43:Alpha:ga9.11  -2.5  0.6      9     11
         44:Alpha:ga30.32  -1.3  0.5     30     32
         45:Alpha:ga49.51  -2.5  0.6     49     51
   50:Alpha:time1:ga.time  -1.7  0.9      1     NA
   51:Alpha:time2:ga.time  -1.5  0.8      2     NA
   52:Alpha:time3:ga.time  -1.0  1.0      3     NA
             2:Beta:ea2.9  -1.7  0.6      2      9
           3:Beta:ea17.19  -5.0  0.8     17     19
           4:Beta:ea22.24  -6.4  1.0     22     24
                8:Beta:as   0.2  0.6     NA     NA
                9:Beta:sd   1.7  0.4     NA     NA
     12:Beta:time1:ea.tim  -2.6  1.8      1     NA
     13:Beta:time10:ea.ti  -3.6  1.1     10     NA
     14:Beta:time11:ea.ti  -3.1  0.7     11     NA

', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")

Upvotes: 1

Views: 89

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

I suggest using stringr library to extract the data you need since it handles NA values better, and also allows using a constrained-width lookbehind:

> library(stringr)
> my.data$time1 <- str_extract(my.data$my.string, "(?<=time)\\d+|(?<=\\b[ge]a)\\d+")
> my.data$time2 <- str_extract(my.data$my.string, "(?<=\\b[ge]a\\d{1,100}\\.)\\d+")
> my.data
                my.string cov1 cov2 time1 time2
1          42:Alpha:ga6.8 -0.1  2.2     6     8
2         43:Alpha:ga9.11 -2.5  0.6     9    11
3        44:Alpha:ga30.32 -1.3  0.5    30    32
4        45:Alpha:ga49.51 -2.5  0.6    49    51
5  50:Alpha:time1:ga.time -1.7  0.9     1  <NA>
6  51:Alpha:time2:ga.time -1.5  0.8     2  <NA>
7  52:Alpha:time3:ga.time -1.0  1.0     3  <NA>
8            2:Beta:ea2.9 -1.7  0.6     2     9
9          3:Beta:ea17.19 -5.0  0.8    17    19
10         4:Beta:ea22.24 -6.4  1.0    22    24
11              8:Beta:as  0.2  0.6  <NA>  <NA>
12              9:Beta:sd  1.7  0.4  <NA>  <NA>
13   12:Beta:time1:ea.tim -2.6  1.8     1  <NA>
14   13:Beta:time10:ea.ti -3.6  1.1    10  <NA>
15   14:Beta:time11:ea.ti -3.1  0.7    11  <NA>

The first regex matches:

  • (?<=time)\\d+ - 1+ digits that have time before them
  • | - or
  • (?<=\\b[ge]a)\\d+ - 1+ digits that have ge or ea` as a whole word in front

The second regex matches:

  • (?<=\\b[ge]a\\d{1,100}\\.) - check if the current position is preceded with ge or ea as a whole word followed with 1 to 100 digits (I believe that should be enough for your scenario, 100-digit chunks are hardly expected here, you may even decrease the value), and then a .
  • \\d+ - 1+ digits

Upvotes: 2

Brian Stephens
Brian Stephens

Reputation: 5261

Here's a regex that will extract either of the two types, and output them to different columns at the end of the lines:

Search: .*(?:time(\d+)|(?:[ge]a)(\d+)\.(\d+)).*

Replace: $0\t$1\t$2\t$3

Breakdown:

  • .*(?: ... ).* ensures that the whole line is matched, and uses a non-capturing group for the main alternation
  • time(\d+): this is the first half of the alternation, capturing any digits after a "time"
  • (?:[ge]a)(\d+)\.(\d+): the second half of the alternation matches "ga" or "ea" followed by two sets of digits, each in its own capture group
  • Replacement: $0 puts the whole line back. Each of the other capture groups are added, with tabs in-between.

See regex101 example

Upvotes: 1

Related Questions