Regex for Newbie

Question

I'm new to regex and trying to figure something out for use in scala.

I'm trying to identify URLs within a very long string. I've looked around a lot and the best I've found is

val regex = """https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?""".r

That leaves a little something to be desired however leaving things like "">Images" at the end. I'm trying to figure out what the heck my regex means so I can dissect it and make it stop when it hits a non word character after the . in .com/.org/.edu/.whatever.

I was hoping someone wouldn't mind explaining what individual elements are within this pre-formed regex so that I may figure out what's going on and learn more about regex. I've gone through a tutorial or two and found out some things, but what I've asked for I think would be invaluable to me right now.

I get that:

? after https means the s is optional
? after elements mean they're optional
\w seems to mean word characters
\d seems to mean digits
. cover most characters unless escaped

I don't get:

how we're figuring out when to escape
how : works or +
what escape characters are really (I thought it was a backslash, but that doesn't seem to work here?)
how to specify that a requirement can word for a range, so like word char isn't just one char, but 1-X chars

Anyways I was hoping someone could mentor me for a question rather than shove me to yet another tutorial by helping explain individual elements as they come up. I'd appreciate it.

regexlib was helpful and got me:

val regex = """https?://\w+\.\w+\.\w+[\w/_\.\?=&:]+""".r

every bit of which I understand!

Neil Essy · Accepted Answer

I think your main problem with ">Images being included is solved by replacing the part matching the query html string

(\?\S+)

with something that does not include " < > as the \S does

(\?[\w=$&.\-^@#~+%]+)

Regex for Newbie

Answers (1)

Related Questions