Reputation: 31
I have a long text, where can be links like
schema://example.com/{entity}/{id}
.
I need to extract them look like:
{entity1} => {id1}
{entity1} => {id2}
{entity2} => {id3}
{entity2} => {id4}
I can extract all url with
\bschema:\/\/(?:(?!&[^;]+;)[^\s"'<>)])+\b
And parse it then with
schema:\/\/example\.com\./(.*)\/(.*)
But I need more optimized way. Could you help me, please?
Upvotes: 2
Views: 115
Reputation: 47992
As with all regex tasks, you can improve efficiency by using "negated character classes" and minimizing your "capture groups".
Demo Link (Pattern #1 62 steps) (Pattern #2 60 steps & smaller output array)
$string="bskdkbfnz schema://example.com/bob/1. flslnenf. Ddndkdn schema://example.com/john/2";
// This one uses negated characters classes with 2 capture groups
var_export(preg_match_all("~\bschema://example\.com/([^/]*)/([^.\s]*)~",$string,$out)?array_combine($out[1],$out[2]):'no matches');
echo "\n";
// This one uses negated character classes with 1 capture group. \K restarts the fullstring match.
var_export(preg_match_all("~\bschema://example\.com/([^/]*)/\K[^.\s]*~",$string,$out)?array_combine($out[1],$out[0]):'no matches');
Output:
array (
'bob' => '1',
'john' => '2',
)
array (
'bob' => '1',
'john' => '2',
)
If you find that your second targeted substring is matching too far because of a certain character, just add that character to the negated character class.
I can't be 100% confident regarding the variability of your data, but if entity
substrings are always lowercase letters, you could use [a-z]
. If id
substrings are always numbers, you could use \d
. This decision requires intimate knowledge of the expected input strings.
Upvotes: 1
Reputation: 23958
Not sure if I understood the complexity of the question but this should do what you need.
I use the pattern to capture the entity and id and then I combine them with array_combine.
Preg_match_all("~schema://example.com/(.*?)/(.*?)(\.|\s|$)~", $txt, $matches);
$arr = array_combine($matches[1],$matches[2]);
Var_dump($arr);
Upvotes: 1