Reputation: 143
I am trying to parse through URLs using Ruby and return the URLs that match a word after the "/" in .com , .org , etc.
If I am trying to capture "questions" in a URL such as
https://stackoverflow.com/questions
I also want to be able to capture https://stackoverflow.com/blah/questions
. But I do not want to capture https://stackoverflow.com/queStioNs
.
Currently my expression can match https://stackoverflow.com/questions
but cannot match with "questions" after another "/", or 2 "/"s, etc.
The end of my regular expression is using \bquestions\
.
I tried doing ([a-zA-Z]+\W{1}+\bjob\b|\bjob\b)
but this only gets me URLs with /questions
and /blah/questions
but not /blah/bleh/questions
.
What am I doing wrong and how do I match what I need?
Upvotes: 0
Views: 282
Reputation: 6918
I don't know whether there is any simple way around, here is my solution:
regexp = '^(https|http)?:\/\/[\w]+\.(com|org|edu)(\/{1}[a-z]+)*$'
group_length = "https://stackoverflow.com/blah/questions".match(regexp).length
"https://stackoverflow.com/blah/questions".match(regexp)[group_length - 1].gsub("/","")
It will return 'questions'
.
Update as per you comments below:
use [\S]*(\/questions){1}$
Hope it helps :)
Upvotes: 0
Reputation: 4056
You don't actually need a regex for this, you can instead use the URI module:
require 'uri'
urls = ['https://stackoverflow.com/blah/questions', 'https://stackoverflow.com/queStioNs']
urls.each do |url|
the_path = URI(url).path
puts the_path if the_path.include?'questions'
end
Upvotes: 4