Matching a GitHub URL with spaces in it

Question

I am attempting to match links using regex from a large file of parsed pdfs. Because of the pdf parsing, there are random spaces throughout the text. This includes within what would normally be valid URLs.

This is my test set of links:

https://github.com/dufourya/FASTl.git
asdwd

https://github.com/james-cole/brai-download-nage dwda


https://github.com


https://github.com/allgebrist/algodyn/tree/master/R

github.com/james-col


https://github.com/james-cole/brai-download-nage dwda.git
https://github.com/james-cole/brai-downl oad-nage test

https://github.com/jamesc-ole/braidown/ awwdaw/loadna-ge test

gi th ub. com/james-cole/br ai -d ow nl/ wd/wd oad-nage

https://github.com/james-cole/brai-downl/ wd/wd oad-nage test

I have been fiddling with regex for quite some time now, and my best attempts are the following: (?:g ?i ?t ?@ ?|h ?t ?t ?p ?s ?: ?\/ ?\/ ?)?(g ?i ?t ?h ?u ?b ?. ?c ?o ?m)(\/)*((([a-zA-Z0-9\-\_: ])*(?=(\/))+)|[:\/\-\_]|[a-zA-Z0-9])* ?[a-zA-Z0-9\-]*((.git)|(.io))?

(?:g ?i ?t ?@ ?|h ?t ?t ?p ?s ?: ?\/ ?\/ ?)?(g ?i ?t ?h ?u ?b ?. ?c ?o ?m)(([^ ]*) ?([^ ]*(?=( ?\/ ?))))

I match the protocol and github.com part of the links manually with potential spaces inserted. For the next part, I am trying to match any text with a positive lookahead for a forward slash, any number of times, and then match the text after the last slash, one more space, and then the following text. So the end result will be every link also including the word after, in order to then manually go through to determine which should be part of the links.

Matching a GitHub URL with spaces in it

Answers (1)

Related Questions