aa710
aa710

Reputation: 69

REGEX: Remove middle of string after certain number of "/"

How do I remove the middle of a string using regex. I have the following url: https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml

but I want it to look like this:

https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml

I can get rid of everything after "data/../../" That last long string of numbers isnt needed

I tried this

    sub(sprintf("^((?:[^/]*;){8}).*"),"", URLxml)

But it doesnt do anything! Help please!

Upvotes: 0

Views: 670

Answers (2)

t c
t c

Reputation: 101

a<-'https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml'

gsub('data/(.+?)/(.+?)/(.+?)/','data/\\1/\\2/',a)

so in the url:

data/.../.../..(this is removed)../ ....

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627119

To remove the last but one subpart of the path, you may use

x <- "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml"
sub("^(.*/).*/(.*)", "\\1\\2", x)
## [1] "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml"

See the online R demo and here is a regex demo.

Details:

  • ^ - start of a string
  • (.*/) - Group 1 (referred to with \1 from the replacement string) any 0+ chars up to the last but one /
  • .*/ - any 0+ chars up to the last /
  • (.*) - Group 2 (referred to with \2 backreference from the replacement string) any 0+ chars up to the end.

Upvotes: 1

Related Questions