Reputation: 39
I would like to use preg_match in PHP to parse the site lang out of a html document;
My preg_match:
$sitelang = preg_match('!<html lang="(.*?)">!i', $result, $matches) ? $matches[1] : 'Site Language not detected';
When I have a simple attribute without any class or ids. For example: Input:
<html lang="de">
Output:
de
But when I have a other html code like this: Input:
<html lang="en" class="desktop-view not-mobile-device text-size-normal anon">
Output:
en " class=" desktop - view not - mobile - device text - size - normal anon,
I need just the lang code(en, de, en-En, de-DE).
Thanks for your advice or code.
UPDATE**
Another example when lang attribute is not coming as first element.
<html data-n-head-ssr lang="en">
Output:
Site Language not detected
Upvotes: 1
Views: 256
Reputation: 18490
When parsing arbitrary html, preferably use some html parser like DOMDocument.
$dom = new DOMDocument();
@$dom->loadHTML($html);
$lang = $dom->getElementsByTagName('html')[0]->getAttribute('lang');
See PHP demo at tio.run (used the @
to suppress errors if anything goes wrong)
If you insist on using regex, here a bit broader pattern for matching more cases:
$pattern = '~<html\b[^><]+?\blang\s*=\s*["\']\s*\K[^"\']+~i';
$lang = preg_match($pattern, $html, $out) ? $out[0] : "";
\K
resets beginning of the reported match, so we don't need to capture.
See regex demo at regex101 (explanation on right side) or a PHP demo at tio.run
Fyi: Your pattern <html lang="(.*?)">
matches lazily just anything from <html lang="
to ">
Upvotes: 1
Reputation: 329
You can use this code to detect in the right way:
preg_match('!<html.*\s+lang="([^"]+)"!i', $result, $matches)
it also works well for your last sample
Upvotes: 1
Reputation: 53573
Standard disclaimer of using regex to parse HTML aside, there are two things you likely want. First, get rid of the closing bracket in your pattern. Once you have the close quote, the rest of the line doesn't matter. Second, make sure what's inside the quotes doesn't itself contain quotes.
Current, open quote, then anything, then close quote:
preg_match('!<html lang="(.*?)">!i', $result, $matches)
This means if you have lang="foo" class="bar"
you get foo" class="bar
as a match because regex is greedy and that whole string could be considered to be inside the two separate sets of outermost quotes.
New, inside the quotes, one or more of anything but a quote:
preg_match('!<html lang="([^"]+)"!i', $result, $matches)
If you want to be more resilient, change the hard space to one or more whitespace chars:
preg_match('!<html\s+lang="([^"]+)"!i', $result, $matches)
Upvotes: 1