thevoipman
thevoipman

Reputation: 1833

Regex Html Tricky

I have this regex line but it's not working perhaps due to newlines? My goal is to extract the passengers name and phone number.

Here is a snippet of the data i have... it's in a loop of 100 of the below:

<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Ms Wendy Walker-hunter
    </p>

    <p>
        <b>Mobile Number:</b><br />
        161525961468
    </p>

I'm currently just trying to get passengers name first...

$re = '/(?<=Name)(.*)(?=Mobile)/s';
preg_match($re, $str, $matches);

// Print the entire match result
print_r($matches);

Any kind of help I can get on this is greatly appreciated!

Upvotes: 0

Views: 40

Answers (2)

miken32
miken32

Reputation: 42681

Never parse HTML with a regular expression. Here's how you should be doing this sort of thing:

$html = '<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Ms Wendy Walker-hunter
    </p>

    <p>
        <b>Mobile Number:</b><br />
        161525961468
    </p>
</div>
<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Mr John Walker
    </p>

    <p>
        <b>Mobile Number:</b><br />
        16153682486
    </p>
</div>
';
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//div[@class='booking-section']/p[1]/text()[normalize-space()]");
foreach ($results as $node) {
    echo trim($node->textContent) . "\n";
}

This uses an XPath query to get the nodes you're looking for:

//div[@class='booking-section']/p[1]/text()[normalize-space()]

This tells it to select bare text nodes from the first <p> element inside a <div> with class attribute of "booking-section."


According to the documentation:

this function may generate E_WARNING errors when it encounters bad markup. libxml's error handling functions may be used to handle these errors.

I've enabled libxml's internal error handling for this example, to suppress any warnings about the HTML, though of course you should not be outputting warnings to users anyway.

Upvotes: 1

Vasil Anagnostos
Vasil Anagnostos

Reputation: 39

This should work if snippets are always formatted as the example, it relies on the new lines:

$t = '
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468
  </p>
</div>';

preg_match('/Passenger Name:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $name);

preg_match('/Mobile Number:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $phone);

echo trim($name[1]), ' / ', trim($phone[1]);

Outpus is: Ms Wendy Walker-hunter / 161525961468

Same with preg_match_all:

$t = '
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468
  </p>
</div>
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter 2
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468 2
  </p>
</div>
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter 3
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468 3
  </p>
</div>';

preg_match_all('/Passenger Name:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $name);

preg_match_all('/Mobile Number:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $phone);

echo '<pre>';
print_r($name);
print_r($phone);
die;

Output is something like

Array
(
    [1] => Array
    (
            [0] =>     Ms Wendy Walker-hunter
            [1] =>     Ms Wendy Walker-hunter 2
            [2] =>     Ms Wendy Walker-hunter 3
        )

)
Array
(
    [1] => Array
    (
            [0] =>     161525961468
            [1] =>     161525961468 2
            [2] =>     161525961468 3
        )

)

Upvotes: 0

Related Questions