John
John

Reputation: 4028

How to use extract data from this string

I am not good at writing pattern to extract data. I have long document, and below is the specific string that I need to extract.

<p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>

I want to extract XXXX, YYYY, and ZZZZ value.

My first step is to get XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ

$pattern = '/<p><span id="minPrice">^</span></a></span>/';
preg_match($pattern, $data, $matches);
echo ($matches[1]);

But it does not work. So how to extract XXXX, YYYY, and ZZZZ :(

the document that i have is full of error encoding chars so that I can not use loadHTML. It just returns error.

UPDATE 1: So I am able to do

        var_dump(libxml_use_internal_errors(true));
        $DOM = new DOMDocument;
        $DOM->loadHTML($data);
        $items = $DOM->getElementById('minPrice');

And $items is

 DOMElement Object
(
    [tagName] => span
    [schemaTypeInfo] => 
    [nodeName] => span
    [nodeValue] => 最安価格(税込):¥131,649
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => span
    [baseURI] => 
    [textContent] => 最安価格(税込):¥131,649
)

The html is

<span id="minPrice">
    �ň����i(�ō�)�F
    <a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank">
        <span>&yen;131,649</span>
    </a>
</span>

How can I extract http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku and 131,649 ?

Upvotes: 3

Views: 120

Answers (4)

Man use it and sorry for my bad english! PHP Simple HTML DOM Parser and download lib This alternative. Code:

require_once '/simple_html_dom.php';

//here put content or block or DOM  
$html = str_get_html('<p><span id="minPrice">最安価格(税込)<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>');
//OR
//USE get_file_content if need
//$html = file_get_html('example.html');
//select links, and use first element
$link = $html->find('p span#minPrice a',0);//select links, and use first element
//get url
$href =  $link->href;
//get text in span
$span_in_link = $link->find('span',0)->plaintext;
//delete <a></a>
$link->outertext = '';
 //get text in span
$span_id_minPrice = $html->find('p span#minPrice',0)->plaintext;
//delete  &yen;
$span_in_link =  str_replace('&yen;','',$span_in_link);
 //result
echo $span_id_minPrice.'<br>';//最安価格(税込)
echo $href.'<br>';//YYYYY
echo $span_in_link.'<br>';//ZZZZZ 

if you have this > 1, then use it:

 //select all span
$html = str_get_html('
            <p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>
            <p><span id="minPrice">XXXX2<a href="YYYYY2" target="_blank"><span>&yen;ZZZZZ2</span></a></span>
            ');
    $all_span = $html->find('p span#minPrice');
     $data = array();
    foreach($all_span as $element)
    {
        $array = array();
        $link = $element->find('a',0);//select links, and use first element
        //get url
        $href =  $link->href;
        //get text in span
        $span_in_link = $link->plaintext;
        //delete a
        $link->innertext = '';
        //get text in span
        $span_id_minPrice = $element->plaintext;
        //delete  &yen;
        $span_in_link =  str_replace('&yen;','',$span_in_link);

        $array['span#minPrice'] = $span_id_minPrice ;
        $array['href'] =  $href;
        $array['span_in_link'] =  $span_in_link;

        $data [] = $array;

    }

    echo '<pre>';
    print_r($data);

Result:

Array (

[0] => Array
    (
        [span#minPrice] => XXXX 
        [href] => YYYYY
        [span_in_link] => ZZZZZ 
    )

[1] => Array
    (
        [span#minPrice] => XXXX2 
        [href] => YYYYY2
        [span_in_link] => ZZZZZ2 
    )

)

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You can use the following code line to enable internal error handling for the DOM parser:

libxml_use_internal_errors(true);

Then, you can access the data you need with this sample code:

$html = <<<DATA
<p><span id="minPrice">最安価格(税込):<a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank"><span>&yen;131,649</span></a></span>
DATA;

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$spans = $xpath->query('//span[@id="minPrice"]');   // Get all spans with ID=minPrice
$a = array();
foreach($spans as $span) { 
    foreach($span->childNodes as $child) {          // Check the child nodes
        if ($child->nodeName == "a") {
            array_push($a, $child->getAttribute("href"));
        }
    }
    array_push($a, preg_replace('~^.*?(\d+(?:,\d+)*)$~u', '$1', $child->nodeValue));
}

print_r($a);

Result:

Array
(
    [0] => http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku
    [1] => 131,649
)

I used a regex to extract the number at the end of the string, but you can use an explode with the yen symbol, too.

$num = explode(html_entity_decode("&yen;"), $child->nodeValue)[1];
array_push($a, $num);

See another demo

Upvotes: 1

Rahul
Rahul

Reputation: 726

Use this Regexp -

/<p><span.*id=\"minPrice\">(.*)<a.*href="(.*?)".*>.*<span>.*;(.*?)<\/span>.*/

Result -

  1. XXXX
  2. YYYYY
  3. ZZZZZ

Upvotes: 0

apokryfos
apokryfos

Reputation: 40653

This could be done with regular expressions and the regular expression to get that exact match is :

$regex = "/<p><span id=\"minPrice\">(.*?)<a href=\"(.*?)\" target=\"_blank\"><span>&yen;(.*)<\/span><\/a>/";
preg_match($regex, $data, $matches);

However, as mentioned in the comments, regex is not an appropriate tool to do this task. This regex will probably fail if the document is long and nests these matchable patterns (i.e. if XXXX is another one of these paragraphs). You should probably see how you can fix this document to make it proper XHTML and then use a proper XML parser. You can mitigate this by running this regex on each line of input (assuming it's split into lines properly), but still, not ideal.

Upvotes: 0

Related Questions