Reputation: 95
I'm new to regex and trying to figure something out for use in scala.
I'm trying to identify URLs within a very long string. I've looked around a lot and the best I've found is
val regex = """https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?""".r
That leaves a little something to be desired however leaving things like "">Images" at the end. I'm trying to figure out what the heck my regex means so I can dissect it and make it stop when it hits a non word character after the .
in .com
/.org
/.edu
/.whatever
.
I was hoping someone wouldn't mind explaining what individual elements are within this pre-formed regex so that I may figure out what's going on and learn more about regex. I've gone through a tutorial or two and found out some things, but what I've asked for I think would be invaluable to me right now.
I get that:
?
after https means the s is optional?
after elements mean they're optional\w
seems to mean word characters\d
seems to mean digits .
cover most characters unless escaped I don't get:
:
works or +
Anyways I was hoping someone could mentor me for a question rather than shove me to yet another tutorial by helping explain individual elements as they come up. I'd appreciate it.
regexlib
was helpful and got me:
val regex = """https?://\w+\.\w+\.\w+[\w/_\.\?=&:]+""".r
every bit of which I understand!
Upvotes: 0
Views: 335
Reputation: 3607
I think your main problem with ">Images being included is solved by replacing the part matching the query html string
(\?\S+)
with something that does not include " < > as the \S does
(\?[\w=$&.\-^@#~+%]+)
Upvotes: 2