Agustín Indaco
Agustín Indaco

Reputation: 580

Extracting pattern from raw string

I am trying to extract latitude and longitude from a raw dataset. The information I am interested in follows always the same pattern, namely:

(,)(0-9)([.])(0-9) space (0-9)([.])(0-9)(,)

When I do the following, I am able to remove exactly the information I want to keep. Is there a way to do the inverse and actually keep the information I am removing at the moment using gsub?

data$l1<-gsub('(,)([0-9]+)([.])([0-9]+)[ ]([0-9]+)([.])([0-9]+)(,)', 
              '\\2\\3\\4\\5\\6\\7',
              data$V1)

The dataset looks something like this:

V1
60346241,[37.55 55.22 5km],katekin,55.745011917 37.604520766,2013-12-04 11:59:07
603423423,[37.55 55.22 5km],#hello,#yes,miguel,53.23452 38.7379422,2013-12-04 11:49:09

So, in this example I would like to generate a new variable V2, that would be

V2
55.745011917 37.604520766
53.23452 38.7379422

Upvotes: 3

Views: 113

Answers (1)

Jota
Jota

Reputation: 17611

I would use gregexpr and regmatches

regmatches(d$V1, gregexpr("(?<=,)\\d{1,3}\\.\\d+\\s\\d{1,3}\\.\\d+", d$V1, perl = TRUE))

#[[1]]
#[1] "55.745011917 37.604520766"
#
#[[2]]
#[1] "53.23452 38.7379422"

Unlisting it and putting it in a new variable is left up to the asker.

The approach here is to look for 1 to 3 digits followed by a decimal (\\d{1,3}\\.) followed by some digits and a space (\\d+\\s), then repeat, except without the trailing space. The whole thing should be preceded by a comma. So, you can use a lookbehind for the comma (i.e. (?<=,))


You could use gsub, though with a couple of slight modifications:

gsub("^.+?(?<=,)(\\d{1,3}\\.\\d+\\s\\d{1,3}\\.\\d+).+$", "\\1", d$V1, perl = TRUE)
# [1] "55.745011917 37.604520766" "53.23452 38.7379422"

With the gsub approach, I use a capture group to capture the part I want: (\\d{1,3}\\.\\d+\\s\\d{1,3}\\.\\d+), but I also match everything from the start of the line up to what I want to capture: ^.+?(?<=,) and everything after it until the end of the line: .+$


Data:

d <- read.table(text = "V1
60346241,[37.55 55.22 5km],katekin,55.745011917 37.604520766,2013-12-04 11:59:07
603423423,[37.55 55.22 5km],#hello,#yes,miguel,53.23452 38.7379422,2013-12-04 11:49:09", header = TRUE, comment.char = "", sep = "\t")

Upvotes: 3

Related Questions