Reputation: 580
I am trying to extract latitude and longitude from a raw dataset. The information I am interested in follows always the same pattern, namely:
(,)(0-9)([.])(0-9) space (0-9)([.])(0-9)(,)
When I do the following, I am able to remove exactly the information I want to keep. Is there a way to do the inverse and actually keep the information I am removing at the moment using gsub?
data$l1<-gsub('(,)([0-9]+)([.])([0-9]+)[ ]([0-9]+)([.])([0-9]+)(,)',
'\\2\\3\\4\\5\\6\\7',
data$V1)
The dataset looks something like this:
V1
60346241,[37.55 55.22 5km],katekin,55.745011917 37.604520766,2013-12-04 11:59:07
603423423,[37.55 55.22 5km],#hello,#yes,miguel,53.23452 38.7379422,2013-12-04 11:49:09
So, in this example I would like to generate a new variable V2, that would be
V2
55.745011917 37.604520766
53.23452 38.7379422
Upvotes: 3
Views: 113
Reputation: 17611
I would use gregexpr
and regmatches
regmatches(d$V1, gregexpr("(?<=,)\\d{1,3}\\.\\d+\\s\\d{1,3}\\.\\d+", d$V1, perl = TRUE))
#[[1]]
#[1] "55.745011917 37.604520766"
#
#[[2]]
#[1] "53.23452 38.7379422"
Unlisting it and putting it in a new variable is left up to the asker.
The approach here is to look for 1 to 3 digits followed by a decimal (\\d{1,3}\\.
) followed by some digits and a space (\\d+\\s
), then repeat, except without the trailing space. The whole thing should be preceded by a comma. So, you can use a lookbehind for the comma (i.e. (?<=,)
)
You could use gsub
, though with a couple of slight modifications:
gsub("^.+?(?<=,)(\\d{1,3}\\.\\d+\\s\\d{1,3}\\.\\d+).+$", "\\1", d$V1, perl = TRUE)
# [1] "55.745011917 37.604520766" "53.23452 38.7379422"
With the gsub
approach, I use a capture group to capture the part I want: (\\d{1,3}\\.\\d+\\s\\d{1,3}\\.\\d+)
, but I also match everything from the start of the line up to what I want to capture: ^.+?(?<=,)
and everything after it until the end of the line: .+$
Data:
d <- read.table(text = "V1
60346241,[37.55 55.22 5km],katekin,55.745011917 37.604520766,2013-12-04 11:59:07
603423423,[37.55 55.22 5km],#hello,#yes,miguel,53.23452 38.7379422,2013-12-04 11:49:09", header = TRUE, comment.char = "", sep = "\t")
Upvotes: 3