Elena
Elena

Reputation: 31

Regex for finding URLs inside text and parse them for uri

I have a long text, where can be links like schema://example.com/{entity}/{id}.

I need to extract them look like:

{entity1} => {id1}
{entity1} => {id2}
{entity2} => {id3}
{entity2} => {id4}

I can extract all url with

\bschema:\/\/(?:(?!&[^;]+;)[^\s"'<>)])+\b

And parse it then with

schema:\/\/example\.com\./(.*)\/(.*)

But I need more optimized way. Could you help me, please?

Upvotes: 2

Views: 115

Answers (2)

mickmackusa
mickmackusa

Reputation: 47992

As with all regex tasks, you can improve efficiency by using "negated character classes" and minimizing your "capture groups".

Demo Link (Pattern #1 62 steps) (Pattern #2 60 steps & smaller output array)

$string="bskdkbfnz schema://example.com/bob/1. flslnenf. Ddndkdn schema://example.com/john/2";

// This one uses negated characters classes with 2 capture groups
var_export(preg_match_all("~\bschema://example\.com/([^/]*)/([^.\s]*)~",$string,$out)?array_combine($out[1],$out[2]):'no matches');

echo "\n";
// This one uses negated character classes with 1 capture group. \K restarts the fullstring match.
var_export(preg_match_all("~\bschema://example\.com/([^/]*)/\K[^.\s]*~",$string,$out)?array_combine($out[1],$out[0]):'no matches');

Output:

array (
  'bob' => '1',
  'john' => '2',
)
array (
  'bob' => '1',
  'john' => '2',
)

If you find that your second targeted substring is matching too far because of a certain character, just add that character to the negated character class.

I can't be 100% confident regarding the variability of your data, but if entity substrings are always lowercase letters, you could use [a-z]. If id substrings are always numbers, you could use \d. This decision requires intimate knowledge of the expected input strings.

Upvotes: 1

Andreas
Andreas

Reputation: 23958

Not sure if I understood the complexity of the question but this should do what you need.

I use the pattern to capture the entity and id and then I combine them with array_combine.

Preg_match_all("~schema://example.com/(.*?)/(.*?)(\.|\s|$)~", $txt, $matches);

$arr = array_combine($matches[1],$matches[2]);
Var_dump($arr);

https://3v4l.org/NGrFQ

Upvotes: 1

Related Questions