Prepare regex to work with well-formated code

Question

I'm working on data parser that getting data from zbozi.cz and I have a problem. The function parse is preparing the data i get from zbozi.cz to valid JSON and decode it. Check out https://github.com/Northys/Venom/blob/master/libs/Venom/Strings.php

I'm not familiar with regex, but I tried to create a one with my book - I have something like this (I shorted it):

/* */

And I need to get a valid JSON to decode it with my parse function. I'm using pattern /.*\( / together with preg_replace function to delete stuff before { id:...} appears. Unfortunately, in the future they can add more white-spaces, tidy the code or something that makes my script doesn't work.

Everything I need is to edit the parse function (link bellow). Regex pattern on the line 23 and some str_replace on the following lines needs to be changed for preg_replace functions. Can you please help me?

This is the code my script work with - https://github.com/Northys/Venom/blob/master/crawled/1.html - just prest CTRL F and find Zbozi.Common.Result

And my script doesn't work with https://github.com/Northys/Venom/blob/master/crawled/0.html - line 305

I need to change regex to make it work with both files.

Casimir et Hippolyte · Accepted Answer

You can try this:

$subject = <<<'LOD'
/*  */
LOD;

$replacements = array(
    '~/\* \s*+ \Q '[',
    '~(?<=}) \s*+ , \s*+ null \s*+ $; \s*+ /\* \s*+ ]]> \s*+ \*/~x'                        => ']',
    '~(?> \{2} )*+ \K \'~x'                                                                => '"',
    '~" [^"]*+ " (*SKIP) (*FAIL) | \s*+ (\w++) \s*+ : \s*+~x'                               => ' "$1":'
);

foreach ($replacements as $pattern => $replacement) {
    $subject = preg_replace($pattern, $replacement, $subject);
}

var_dump($subject);

Patterns details:

The two first patterns aim to trim what you don't need after and before the (futur) JSON object. The two last patterns are for quotes.

In all patterns:

For more readability, I use the x modifier (extended mod), thus whitespaces are ignored. In the same way, \Q.....\E syntaxe is used to write litteral substrings. (special characters are ignored inside).

All quantifiers are possessive (++ or *+) instead of simple quantifiers (+ or *). It's not essential to get the result (except in the third pattern) but those indicate to the regex engine that there's no need to record backtrack positions. You can find more about this here.
The same for the atomic groups (?>.....) that replace the non-capturing groups (?:.....)

First pattern:

Nothing particular, literal atserisk must be escaped and \Q...\E syntax is used and avoid to escape opening square brackets and dots.

Second pattern:

A lookbehind (?<=}) is used to check if there is a closing curly bracket before. (this is just a check, that means that the subpattern inside (?<=...) is not a part of the match).

Third pattern:

This pattern will find single quotes that are not escaped. To do that, you must verify that there is an even number of backslashes or no backslash before the single quote. Indeed, \\' is two backslashes and a quote, \\\' is two backslashes and an escaped quote (i.e. a literal quote).

\K will remove the begining of the pattern (the backslashes checking) from the match result. There remains only the single quote.

Fourth pattern:

This will find all words followed by a colon that are not inside double quotes (like http:).

You must first find all the content inside double quotes "[^"]*+" before to exclude it from the match result.
To do that, you can't use the \K trick, because you are in a part of an alternation: .......\K|........ (If this first part succeeded, the preg_replace() function will add the replacement pattern after each substrings inside double quotes!)
The only way is that the regex engine proceed these contents in double quotes and fails. To do this trick, you can use these two backtrack control verbs: (*SKIP) and (*FAIL)
(*SKIP) indicates to the regex engine that the precedent subpattern will fail and can be skipped.
(*FAIL) forces the pattern to fail.

With that you have avoided all the content inside double quotes. Then the other part of the alternation will find only the words with colon outside double quotes.

Prepare regex to work with well-formated code

Answers (1)

Patterns details:

Related Questions