s7h
s7h

Reputation: 51

Regex to stop parsing after semicolon is encountered

I am using this regex to parse URL from a semicolon separated string.

\b(?:https?:|http?:|www\.)\S+\b

It is working fine if my input text is in these formats:

    "Google;\"https://google.com\""
//output - https://google.com
    "Yahoo;\"www.yahoo.com\""
//output - www.yahoo.com

but in this case it gives incorrect string

"https://google.com;\"https://google.com\"" 
//output - https://google.com;\"https://google.com

how can I stop the parsing when I encounter the ';' ?

Upvotes: 1

Views: 374

Answers (3)

Chris
Chris

Reputation: 2304

I would personally just modify the regex to look specifically for URLs and add some conditionals to the https:// protocols and www quantifier. Using \S+ can be kind of iffy because it will grab every non whitespace character, in which in a URL, it's limited on the characters you can use.

Something like this should work great for your particular needs.

(https?:\/{2})?([w]{3}.)?\w+\.[a-zA-Z]+

This sets up a conditional on the http (s also optional) protocol which would then be immediately be followed by the ://. Then, it will grab all letters, numbers, and underscores as many as possible until the ., followed by the last set of characters to end it. You can exchange the [a-zA-Z] character set for a explicit set of domains if you'd prefer.

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163467

For your example data you might use a positive lookahead (?=) and a positive lookbehind (?<=)

(?<=")(?:https?:|www\.).+?(?=;?\\")

That would match

  • (?<=") Positive lookbehind to assert that what is on the left side is a double quote
  • (?:https?:|www\.) Match either http with an optional s or www.
  • .+? Match any character one or more times non greedy
  • (?=;?\\") Positive lookahead which asserts that what follows is an optional ; followed by\"

Upvotes: 1

Callum Watkins
Callum Watkins

Reputation: 2991

Looking at your examples, I would just match any URL between quotation marks. Something like this:

(?<=")(?:https?:|www\.)[^"]*

You can try it out here

Or as others have said, split the input string by the semicolon character using string.Split, and check each string sequentially for your desired match.

Upvotes: 1

Related Questions