Reputation: 61
I am new to PHP and i have an xml file and i want to extract the sentences in the xml file to an array using PHP, to break down the sentences to having 3 words each time. The sentences will be divided into parts.
The XML below is from a XML file.
<?xml version="1.0" encoding="utf-8" ?>
<document>
<content>
<segment>
<sentence>
<word>Hi</word>
<word>there</word>
<word>people</word>
<word>I</word>
<word>want</word>
<word>to</word>
<word>introduce</word>
<word>you</word>
<word>to</word>
<word>my</word>
<word>world</word>
</sentence>
<sentence>
<word>Hi</word>
<word>there</word>
<word>people</word>
<word>I</word>
<word>want</word>
<word>to</word>
<word>introduce</word>
<word>you</word>
<word>to</word>
<word>my</word>
<word>world</word>
</sentence>
</segment>
</content>
</document>
The output will be:
Hi there people
I want to
introduce you to
my world
Hi there people
I want to
introduce you to
my world
I have created a function to process the xml trannscript.
function loadTranscript($xml) {
$getfile = file_get_contents($xml);
$arr = simplexml_load_string($getfile);
foreach ($arr->content->segment->sentence as $sent) {
$count = str_word_count($sent,1);
$a=array_chunk($count,3);
foreach ($a as $a){
echo implode(' ',$a);
echo PHP_EOL;
}
}
}
But was unable to produce the output. Is $sent
considered an array? I want to break the sentences at XML level.
Upvotes: 3
Views: 178
Reputation: 1769
Is $xml
a string or a file path? I'm considering that is a string for this answer.
Use DOMDocument and make it happens
function loadTranscript($xml) {
$doc = new DOMDocument();
$doc->loadXML($xml);
$words = $doc->getElementsByTagName('word');
$i = 0;
foreach ($words as $word) {
if ($i >= 3) {
echo "\n";//it works on console. For browsers you should use echo "<br>";
$i = 0;
}
echo $word->nodeValue.' ';
$i++;
}
}
I used a extra $i
flag to avoid the foreach inside another foreach, but you can adapt the code to your needs.
As suggested by @CD001 in the comments, following is a new version that consider more than one tag <sentence>
.
function loadTranscript($xml) {
$doc = new DOMDocument();
$doc->loadXML($xml);
$sentences = $doc->getElementsByTagName('sentence');
foreach($sentences as $sentence) {
$words = $sentence->getElementsByTagName('word');
$i = 0;
foreach ($words as $word) {
if ($i >= 3) {
echo "\n";
$i = 0;
}
echo $word->nodeValue.' ';
$i++;
}
echo "\n";
}
}
To read the XML from a file, replace the $doc->loadXML($xml);
by $doc->load('file/path/string.xml');
Upvotes: 1
Reputation: 97708
I'm not sure why everyone is so scared of SimpleXML, and I think it's definitely the right tool for this job.
$sent
is not an array, but an object representing the <sentence>
element and all its children; it has some array-like properties, but not ones that array_chunk
can work with.
You can actually use array_chunk
, but you need to do three things to make your current code work:
$sent
from object to array with (array)$sent
(which will give an array of all children of the <sentence>
node) or (array)$sent->word
(which will limit it to those called <word>
, in case there was a mixture)array_chunk
, not $count
(which you don't need)foreach( $a as $a )
)So:
$chunks = array_chunk((array)$sent->word, 3);
foreach ($chunks as $a_chunk) {
echo implode(' ', $a_chunk);
echo PHP_EOL;
}
Alternatively, you can do without array_chunk
easily enough by just displaying a newline every third word:
$counter = 0;
foreach ( $words as $word ) {
$counter++;
echo $word;
if ( $counter % 3 == 0 ) {
echo PHP_EOL;
} else {
echo ' ';
}
}
Then all you need to do is nest that loop inside your existing one:
foreach ($arr->content->segment->sentence as $sent) {
$counter = 0;
foreach ( $sent->word as $word ) {
$counter++;
echo $word;
if ( $counter % 3 == 0 ) {
echo PHP_EOL;
} else {
echo ' ';
}
}
echo PHP_EOL;
}
Up to you which you think is cleaner, but it's good to understand both so you can adapt them to future needs.
Upvotes: 2
Reputation: 107587
Consider XSLT, the special-purpose, W3C-conformant language (sibling to XPath) designed to transform XML documents. XSLT can transfrom to text formats. With this approach, no foreach
loop or if
logic is needed. PHP can run XSLT 1.0 scripts with its built-in php-xsl extension which may need to be enabled in .ini file. And the beauty of XSLT is that it is a well-formed XML file and can be parsed like source XML from file or embedded string.
Specifically, the XSLT runs the Identity Transform to copy document as is without nodes and then for each <word>
node, template checks if the current position is a multiple of three to add a line break. And then adds a line break after the last <word>
. Notice too at top <xsl:output>
method is text.
XSLT (save as .xsl)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:apply-templates select="@*|node()"/>
</xsl:template>
<xsl:template match="word">
<xsl:value-of select="concat(., ' ')"/>
<xsl:if test="(position() mod 3) = 0">
<xsl:text>
</xsl:text>
</xsl:if>
<xsl:if test="position() = last()">
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
</xsl:transform>
PHP
// LOAD XML AND XSL
$xml = new DOMDocument();
$xml->load('Input.xml');
$xsl = new DOMDocument;
$xsl->load('XSLTScript.xsl');
// INITIALIZE TRANSFORMER
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl);
// RUN TRANSFORMATION
$newXML = $proc->transformToXML($xml);
// ECHO STRING OUTPUT
echo $newXML;
# Hi there people
# I want to
# introduce you to
# my world
# Hi there people
# I want to
# introduce you to
# my world
Upvotes: 0