Reputation: 1
I am attempting to match links using regex from a large file of parsed pdfs. Because of the pdf parsing, there are random spaces throughout the text. This includes within what would normally be valid URLs.
This is my test set of links:
https://github.com/dufourya/FASTl.git
asdwd
https://github.com/james-cole/brai-download-nage dwda
https://github.com
https://github.com/allgebrist/algodyn/tree/master/R
github.com/james-col
https://github.com/james-cole/brai-download-nage dwda.git
https://github.com/james-cole/brai-downl oad-nage test
https://github.com/jamesc-ole/braidown/ awwdaw/loadna-ge test
gi th ub. com/james-cole/br ai -d ow nl/ wd/wd oad-nage
https://github.com/james-cole/brai-downl/ wd/wd oad-nage test
I have been fiddling with regex for quite some time now, and my best attempts are the following:
(?:g ?i ?t ?@ ?|h ?t ?t ?p ?s ?: ?\/ ?\/ ?)?(g ?i ?t ?h ?u ?b ?. ?c ?o ?m)(\/)*((([a-zA-Z0-9\-\_: ])*(?=(\/))+)|[:\/\-\_]|[a-zA-Z0-9])* ?[a-zA-Z0-9\-]*((.git)|(.io))?
(?:g ?i ?t ?@ ?|h ?t ?t ?p ?s ?: ?\/ ?\/ ?)?(g ?i ?t ?h ?u ?b ?. ?c ?o ?m)(([^ ]*) ?([^ ]*(?=( ?\/ ?))))
I match the protocol and github.com part of the links manually with potential spaces inserted. For the next part, I am trying to match any text with a positive lookahead for a forward slash, any number of times, and then match the text after the last slash, one more space, and then the following text. So the end result will be every link also including the word after, in order to then manually go through to determine which should be part of the links.
Upvotes: 0
Views: 364
Reputation: 1345
If you can trust that your input is like your test set, then you can match one of the expected leading characters h
or g
or any number of leading spaces, just in case. Followed by any number of characters that aren't \n
(the "newline").
^(g|h| +)[^\n]+
Try the pattern on https://regex101.com/r/Jaw2hd/1
Upvotes: 1