Reputation: 41

How to parse the following html code get all text before "br" tag

I have the following html code:

    <td class="role" style=""><a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Microsoft">Microsoft</a><br />
    <a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Corbis">Corbis</a><br />
    Co-Chair of the <a href="/wiki/Bill_%26_Melinda_Gates_Foundation">Bill &amp; Melinda   Gates Foundation</a><br />
    <a href="/wiki/Creative_Director" title="Creative Director" class="mw- redirect">Director</a> of <a href="/wiki/Berkshire_Hathaway">Berkshire Hathaway</a><br/>
    <a href="/wiki/CEO" class="mw-redirect" title="CEO">CEO</a> of <a  href="/wiki/Cascade_Investment">Cascade Investment</a></td>

For the above td element, semantically there are five rows, separated by "<br/>", I want to get the five lines as:

Chairman of Microsoft

Chariman of Borbis

Co-Char of the Bill&Melinda Gates Fundation

Creative Director of Berkshire Hathaway

CEO of Cascade Investment

Currently, my solution is to first get all br inside this td, as:

    br_value = td_node.select('.//br')

then for each br_value, I use the following code to get all text:

    for br_item in br_value:
        one_item = br_item.select('.//preceding-sibling::*/text()').extract()

In this case, I can get the line as:

Chairman Microsoft

Chariman Borbis

Bill&Melinda Gates Fundation

Director Berkshire Hathaway

CEO Cascade Investment

Compared with the original text I want, they basically missed "of", also some other texts.

The reason for this is that "preceding-sibling" only return the sibling tags, but can't return the "text" which belongs to its parent, such as "of" in this case.

Anyone here know how to extract the complete information separated by br tag?

Thanks

Upvotes: 3

Answers (2)

Vladica Savic

Reputation: 71

I wrote this small function:

function getCleanLines($rawContent)
{
    $cleanLines = array();
    $regEx = '/<td\sclass="role"[^>]*>(?<CONTENT>.*?)<\/td>/ms';
    preg_match_all($regEx, $rawContent, $matches);

    if(isset($matches['CONTENT'][0]))
    {
        $content = $matches['CONTENT'][0];
        $regEx = '/(?<DATA>.*?)(?:<br\s*\/>|\z)/ms';
        preg_match_all($regEx, $content, $matchedLines);

        if(isset($matchedLines['DATA']))
        {
            foreach($matchedLines['DATA'] as $singleLine)
            {

                $regEx = '#(<a[^>]*>)|(</a>)#';
                $cleanLine = preg_replace($regEx,'',$singleLine);
                if(!empty($cleanLine))
                {
                    $cleanLines[] = preg_replace('/\s\s+/', ' ',$cleanLine);
                }
            }
        }
    }
    return $cleanLines;
}

Use it like this:

$input = 'HERE PUT YOUR HTML FROM PREVIOUS POST';
print_r(getCleanLines($input));

Upvotes: 0

warvariuc

Reputation: 59674

Use this xpath query:

//div[@id='???']/descendant-or-self::*[not(ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]/text()

I.e. to select just text from current and all descendant nodes, use this kind of query: ./descendant-or-self::*/text()

Or shorter (thanks to Empo): .//text()

Upvotes: 2

How to parse the following html code get all text before &quot;br&quot; tag

Answers (2)

Related Questions

How to parse the following html code get all text before "br" tag