Reputation: 69
How do I remove the middle of a string using regex. I have the following url: https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml
but I want it to look like this:
https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml
I can get rid of everything after "data/../../" That last long string of numbers isnt needed
I tried this
sub(sprintf("^((?:[^/]*;){8}).*"),"", URLxml)
But it doesnt do anything! Help please!
Upvotes: 0
Views: 670
Reputation: 101
a<-'https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml'
gsub('data/(.+?)/(.+?)/(.+?)/','data/\\1/\\2/',a)
so in the url:
data/.../.../..(this is removed)../ ....
Upvotes: 0
Reputation: 627119
To remove the last but one subpart of the path, you may use
x <- "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml"
sub("^(.*/).*/(.*)", "\\1\\2", x)
## [1] "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml"
See the online R demo and here is a regex demo.
Details:
^
- start of a string(.*/)
- Group 1 (referred to with \1
from the replacement string) any 0+ chars up to the last but one /
.*/
- any 0+ chars up to the last /
(.*)
- Group 2 (referred to with \2
backreference from the replacement string) any 0+ chars up to the end.Upvotes: 1