Reputation: 13113
I have data like below. I wish to extract the first and last year from each string here called my.string
. Some strings only contain one year and some strings contain no years. No strings contain more than two years. I have provided the desired result in the object named desired.result
below the example data set. I am using R
.
When a string contains two years those years are contained within a portion of the string that looks like this ga49.51
or ea22.24
When a string contains only one year that year is contained in a portion of the string that looks like this: time11
I know a bit about regex
, but this problem seems too irregular and complex for me to figure out. I am not even sure where to begin. Thank you for any advice.
EDIT
Perhaps delete the numbers before the first colon (:) and the remaining numbers are what I want.
my.data <- read.table(text = '
my.string cov1 cov2
42:Alpha:ga6.8 -0.1 2.2
43:Alpha:ga9.11 -2.5 0.6
44:Alpha:ga30.32 -1.3 0.5
45:Alpha:ga49.51 -2.5 0.6
50:Alpha:time1:ga.time -1.7 0.9
51:Alpha:time2:ga.time -1.5 0.8
52:Alpha:time3:ga.time -1.0 1.0
2:Beta:ea2.9 -1.7 0.6
3:Beta:ea17.19 -5.0 0.8
4:Beta:ea22.24 -6.4 1.0
8:Beta:as 0.2 0.6
9:Beta:sd 1.7 0.4
12:Beta:time1:ea.tim -2.6 1.8
13:Beta:time10:ea.ti -3.6 1.1
14:Beta:time11:ea.ti -3.1 0.7
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
desired.result <- read.table(text = '
my.string cov1 cov2 time1 time2
42:Alpha:ga6.8 -0.1 2.2 6 8
43:Alpha:ga9.11 -2.5 0.6 9 11
44:Alpha:ga30.32 -1.3 0.5 30 32
45:Alpha:ga49.51 -2.5 0.6 49 51
50:Alpha:time1:ga.time -1.7 0.9 1 NA
51:Alpha:time2:ga.time -1.5 0.8 2 NA
52:Alpha:time3:ga.time -1.0 1.0 3 NA
2:Beta:ea2.9 -1.7 0.6 2 9
3:Beta:ea17.19 -5.0 0.8 17 19
4:Beta:ea22.24 -6.4 1.0 22 24
8:Beta:as 0.2 0.6 NA NA
9:Beta:sd 1.7 0.4 NA NA
12:Beta:time1:ea.tim -2.6 1.8 1 NA
13:Beta:time10:ea.ti -3.6 1.1 10 NA
14:Beta:time11:ea.ti -3.1 0.7 11 NA
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
Upvotes: 1
Views: 89
Reputation: 626870
I suggest using stringr library to extract the data you need since it handles NA values better, and also allows using a constrained-width lookbehind:
> library(stringr)
> my.data$time1 <- str_extract(my.data$my.string, "(?<=time)\\d+|(?<=\\b[ge]a)\\d+")
> my.data$time2 <- str_extract(my.data$my.string, "(?<=\\b[ge]a\\d{1,100}\\.)\\d+")
> my.data
my.string cov1 cov2 time1 time2
1 42:Alpha:ga6.8 -0.1 2.2 6 8
2 43:Alpha:ga9.11 -2.5 0.6 9 11
3 44:Alpha:ga30.32 -1.3 0.5 30 32
4 45:Alpha:ga49.51 -2.5 0.6 49 51
5 50:Alpha:time1:ga.time -1.7 0.9 1 <NA>
6 51:Alpha:time2:ga.time -1.5 0.8 2 <NA>
7 52:Alpha:time3:ga.time -1.0 1.0 3 <NA>
8 2:Beta:ea2.9 -1.7 0.6 2 9
9 3:Beta:ea17.19 -5.0 0.8 17 19
10 4:Beta:ea22.24 -6.4 1.0 22 24
11 8:Beta:as 0.2 0.6 <NA> <NA>
12 9:Beta:sd 1.7 0.4 <NA> <NA>
13 12:Beta:time1:ea.tim -2.6 1.8 1 <NA>
14 13:Beta:time10:ea.ti -3.6 1.1 10 <NA>
15 14:Beta:time11:ea.ti -3.1 0.7 11 <NA>
The first regex matches:
(?<=time)\\d+
- 1+ digits that have time
before them|
- or(?<=\\b[ge]a)\\d+
- 1+ digits that have ge
or ea` as a whole word in frontThe second regex matches:
(?<=\\b[ge]a\\d{1,100}\\.)
- check if the current position is preceded with ge
or ea
as a whole word followed with 1 to 100 digits (I believe that should be enough for your scenario, 100-digit chunks are hardly expected here, you may even decrease the value), and then a .
\\d+
- 1+ digitsUpvotes: 2
Reputation: 5261
Here's a regex that will extract either of the two types, and output them to different columns at the end of the lines:
Search: .*(?:time(\d+)|(?:[ge]a)(\d+)\.(\d+)).*
Replace: $0\t$1\t$2\t$3
Breakdown:
.*(?:
... ).*
ensures that the whole line is matched, and uses a non-capturing group for the main alternationtime(\d+)
: this is the first half of the alternation, capturing any digits after a "time"(?:[ge]a)(\d+)\.(\d+)
: the second half of the alternation matches "ga" or "ea" followed by two sets of digits, each in its own capture group$0
puts the whole line back. Each of the other capture groups are added, with tabs in-between.Upvotes: 1