Reputation: 41
I have the following html code:
<td class="role" style=""><a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Microsoft">Microsoft</a><br />
<a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Corbis">Corbis</a><br />
Co-Chair of the <a href="/wiki/Bill_%26_Melinda_Gates_Foundation">Bill & Melinda Gates Foundation</a><br />
<a href="/wiki/Creative_Director" title="Creative Director" class="mw- redirect">Director</a> of <a href="/wiki/Berkshire_Hathaway">Berkshire Hathaway</a><br/>
<a href="/wiki/CEO" class="mw-redirect" title="CEO">CEO</a> of <a href="/wiki/Cascade_Investment">Cascade Investment</a></td>
For the above td element, semantically there are five rows, separated by "<br/>"
, I want to get the five lines as:
Chairman of Microsoft
Chariman of Borbis
Co-Char of the Bill&Melinda Gates Fundation
Creative Director of Berkshire Hathaway
CEO of Cascade Investment
Currently, my solution is to first get all br
inside this td
, as:
br_value = td_node.select('.//br')
then for each br_value, I use the following code to get all text:
for br_item in br_value:
one_item = br_item.select('.//preceding-sibling::*/text()').extract()
In this case, I can get the line as:
Chairman Microsoft
Chariman Borbis
Bill&Melinda Gates Fundation
Director Berkshire Hathaway
CEO Cascade Investment
Compared with the original text I want, they basically missed "of", also some other texts.
The reason for this is that "preceding-sibling" only return the sibling tags, but can't return the "text" which belongs to its parent, such as "of" in this case.
Anyone here know how to extract the complete information separated by br
tag?
Thanks
Upvotes: 3
Views: 1343
Reputation: 71
I wrote this small function:
function getCleanLines($rawContent)
{
$cleanLines = array();
$regEx = '/<td\sclass="role"[^>]*>(?<CONTENT>.*?)<\/td>/ms';
preg_match_all($regEx, $rawContent, $matches);
if(isset($matches['CONTENT'][0]))
{
$content = $matches['CONTENT'][0];
$regEx = '/(?<DATA>.*?)(?:<br\s*\/>|\z)/ms';
preg_match_all($regEx, $content, $matchedLines);
if(isset($matchedLines['DATA']))
{
foreach($matchedLines['DATA'] as $singleLine)
{
$regEx = '#(<a[^>]*>)|(</a>)#';
$cleanLine = preg_replace($regEx,'',$singleLine);
if(!empty($cleanLine))
{
$cleanLines[] = preg_replace('/\s\s+/', ' ',$cleanLine);
}
}
}
}
return $cleanLines;
}
Use it like this:
$input = 'HERE PUT YOUR HTML FROM PREVIOUS POST';
print_r(getCleanLines($input));
Upvotes: 0
Reputation: 59674
Use this xpath query:
//div[@id='???']/descendant-or-self::*[not(ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]/text()
I.e. to select just text from current and all descendant nodes, use this kind of query: ./descendant-or-self::*/text()
Or shorter (thanks to Empo): .//text()
Upvotes: 2