Faizan Shaikh Sarkar
Faizan Shaikh Sarkar

Reputation: 313

Regex to find substring between 2 Strings excluding a specific String

I have checked all the existing questions on Stackoverflow but I couldn't find the perfect answer to it and need your help.

So basically I have multiple Strings containing different formats of URL in different ways, for eg:-

1:

<p><a href='https://abcd.com/sites/WG-ProductManagementTeam/FunctionalSpecs/Forms/AllItems.aspx?id=/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements to PA Peer Checklist/PA Peer Checklist (V2.3) -v10.0.pdf&amp;parent=/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements to PA Peer Checklist&amp;p=true&amp;ga=1'>WG-Product Management Team - PA Peer Checklist (V2.3) -v10.0.pdf - All Documents (sharepoint.com)</a></p>

2:

https://abcd.com/sites/WG-ProductManagementTeam/FunctionalSpecs/Forms/AllItems.aspx?id=%2Fsites%2FWG%2DProductManagementTeam%2FFunctionalSpecs%2FDevDOC%2FEnhancements%20to%20PA%20Peer%20Checklist%2FPA%20Peer%20Checklist%20%28V2%2E3%29%20%2Dv10%2E0%2Epdf&parent=%2Fsites%2FWG%2DProductManagementTeam%2FFunctionalSpecs%2FDevDOC%2FEnhancements%20to%20PA%20Peer%20Checklist&p=true&ga=1

3:

https://abcd.com/:b:/r/sites/WG-ProductManagementTeam/FunctionalSpecs/DevDOC/Enhancements%20to%20PA%20Peer%20Checklist/PA%20Peer%20Checklist%20(v2.0)%20-%20v3.0.pdf?csf=1&web=1&e=txs2Yq

I want to extract a part of URL like this:- /DevDOC/....../.pdf

as you can see above shared 3 URL strings are all different but I am not able to find the most efficient way to resolve this.

I need to do it in such a way that it works for every type of URL string even though formats are different it should extract it from any and every String in same way.

Right now I am using regex: "./FunctionalSpecs(?!.\1)(.*?)(.pdf)" and it is working for URL 2 and 3 shared above but in case of URL 1 it is returning:

/DevDOC/Enhancements to PA Peer Checklist&p=true&ga=1'>WG-Product Management Team - PA Peer Checklist (V2.3) -v10.0.pdf

which is incorrect, I wanted this:

/DevDOC/Enhancements to PA Peer Checklist/PA Peer Checklist (V2.3) -v10.0.pdf

Please help me resolve this as soon as possible as It seems so easy but I am not able to do it in an efficient way.

Also, I am trying to do it in Java.

Any help is highly appreciated. Thank you.

Upvotes: 0

Views: 115

Answers (2)

Tiwari
Tiwari

Reputation: 50

you can use decodeURIComponent to decode your url and then you can extract your value like below.

var url = decodeURIComponent("your encoded url string");
console.log(url.match(/DevDOC[\s\S]*\.pdf/i));

Upvotes: 0

Shai Vashdy
Shai Vashdy

Reputation: 1

You can either decode and then use:

 `/DevDOC/[^\.]+\.pdf`

Or without decoding you might want to use:

DevDoc[^\.]+pdf

I'm relying here on the existence of a period before the .pdf, as the regex should keep going until first appearance of a period. If that doesn't work you might want to use [^"]+.

Upvotes: 0

Related Questions