Reputation: 175
I'm new to regex and I am trying to grab urls from a big html-text file. The links are "trapped" in the following types of strings:
,"link_value":"https://www.linkedin.com/company/randomcompanyA"},"event":"link_click&
I want to write a regex line that will get me any string starting and ending with "
, containing linkedin
or instagram
etc. In other words, I want to grab strings/links by defining a substring in that link, so I do not want a general line returning all links in a file. So far I've been able to write the following:
(?<=").+?(?=")
But I'm not able to work in the 'contains linkedin' part in there. The above command would therefore also return link_value
, for example.
Any help is appreciated!
Upvotes: 1
Views: 1060
Reputation: 785176
Since you're already using look arounds, you can make your regex more specific by starting your match with http://
or https://
like this:
(?<=")https?:\/\/[^\/]*?\b(?:linkedin|instagram)\.\S+?(?=")
RegEx Details:
https?:\/\/
will match http://
or https://
[^\/]*?
matches 0 or more of any character that is not /
(lazy)\b(?:linkedin|instagram)\.
will match any of the given strings in the link followed by a dot.\S+?
matches 1 or more of any character that is not a whitespace (lazy)Upvotes: 1
Reputation: 42
this regex will grab URLs regardless the "quot" tags
https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)
Tell me if it works
Upvotes: 0