Matching a GitHub URL with spaces in it

I am attempting to match links using regex from a large file of parsed pdfs. Because of the pdf parsing, there are random spaces throughout the text. This includes within what would normally be valid URLs.

This is my test set of links:

https://github.com/dufourya/FASTl.git
asdwd

https://github.com/james-cole/brai-download-nage dwda


https://github.com


https://github.com/allgebrist/algodyn/tree/master/R

github.com/james-col


https://github.com/james-cole/brai-download-nage dwda.git
https://github.com/james-cole/brai-downl oad-nage test

https://github.com/jamesc-ole/braidown/ awwdaw/loadna-ge test

gi th ub. com/james-cole/br ai -d ow nl/ wd/wd oad-nage

https://github.com/james-cole/brai-downl/ wd/wd oad-nage test

I have been fiddling with regex for quite some time now, and my best attempts are the following: (?:g ?i ?t ?@ ?|h ?t ?t ?p ?s ?: ?\/ ?\/ ?)?(g ?i ?t ?h ?u ?b ?. ?c ?o ?m)(\/)*((([a-zA-Z0-9\-\_: ])*(?=(\/))+)|[:\/\-\_]|[a-zA-Z0-9])* ?[a-zA-Z0-9\-]*((.git)|(.io))?

(?:g ?i ?t ?@ ?|h ?t ?t ?p ?s ?: ?\/ ?\/ ?)?(g ?i ?t ?h ?u ?b ?. ?c ?o ?m)(([^ ]*) ?([^ ]*(?=( ?\/ ?))))

I match the protocol and github.com part of the links manually with potential spaces inserted. For the next part, I am trying to match any text with a positive lookahead for a forward slash, any number of times, and then match the text after the last slash, one more space, and then the following text. So the end result will be every link also including the word after, in order to then manually go through to determine which should be part of the links.

Upvotes: 0

Views: 364

Answers (1)

rhinosforhire
rhinosforhire

Reputation: 1345

If you can trust that your input is like your test set, then you can match one of the expected leading characters h or g or any number of leading spaces, just in case. Followed by any number of characters that aren't \n (the "newline").

^(g|h| +)[^\n]+

Try the pattern on https://regex101.com/r/Jaw2hd/1

Upvotes: 1

Related Questions