kkbum
kkbum

Reputation: 61

Break sentences in XML using PHP

I am new to PHP and i have an xml file and i want to extract the sentences in the xml file to an array using PHP, to break down the sentences to having 3 words each time. The sentences will be divided into parts.
The XML below is from a XML file.

<?xml version="1.0" encoding="utf-8" ?>
<document>
    <content>
        <segment>
            <sentence>
                <word>Hi</word>
                <word>there</word>
                <word>people</word>
                <word>I</word>
                <word>want</word>
                <word>to</word>
                <word>introduce</word>
                <word>you</word>
                <word>to</word>
                <word>my</word>
                <word>world</word>
            </sentence>
            <sentence>
                <word>Hi</word>
                <word>there</word>
                <word>people</word>
                <word>I</word>
                <word>want</word>
                <word>to</word>
                <word>introduce</word>
                <word>you</word>
                <word>to</word>
                <word>my</word>
                <word>world</word>
            </sentence>
        </segment>
    </content>
</document>

The output will be:

Hi there people
I want to 
introduce you to
my world
Hi there people
I want to 
introduce you to
my world

I have created a function to process the xml trannscript.

function loadTranscript($xml) {
    $getfile = file_get_contents($xml);
    $arr = simplexml_load_string($getfile); 
    foreach ($arr->content->segment->sentence as $sent) {
        $count = str_word_count($sent,1);
        $a=array_chunk($count,3);
        foreach ($a as $a){
            echo implode(' ',$a);
            echo PHP_EOL;   
        }
    }
}

But was unable to produce the output. Is $sent considered an array? I want to break the sentences at XML level.

Upvotes: 3

Views: 178

Answers (3)

James
James

Reputation: 1769

Is $xml a string or a file path? I'm considering that is a string for this answer.

Use DOMDocument and make it happens

function loadTranscript($xml) {
    $doc = new DOMDocument();
    $doc->loadXML($xml);
    $words = $doc->getElementsByTagName('word');
    $i = 0;
    foreach ($words as $word) {
        if ($i >= 3) {
            echo "\n";//it works on console. For browsers you should use echo "<br>";
            $i = 0;
        }
        echo $word->nodeValue.' ';
        $i++;
    }
}

I used a extra $i flag to avoid the foreach inside another foreach, but you can adapt the code to your needs.

As suggested by @CD001 in the comments, following is a new version that consider more than one tag <sentence>.

function loadTranscript($xml) {
    $doc = new DOMDocument();
    $doc->loadXML($xml);
    $sentences = $doc->getElementsByTagName('sentence');
    foreach($sentences as $sentence) {
      $words = $sentence->getElementsByTagName('word');
      $i = 0;
      foreach ($words as $word) {
          if ($i >= 3) {
              echo "\n";
              $i = 0;
          }
          echo $word->nodeValue.' ';
          $i++;
      }
      echo "\n";
    }
}

To read the XML from a file, replace the $doc->loadXML($xml); by $doc->load('file/path/string.xml');

Upvotes: 1

IMSoP
IMSoP

Reputation: 97708

I'm not sure why everyone is so scared of SimpleXML, and I think it's definitely the right tool for this job.

$sent is not an array, but an object representing the <sentence> element and all its children; it has some array-like properties, but not ones that array_chunk can work with.

You can actually use array_chunk, but you need to do three things to make your current code work:

  • cast $sent from object to array with (array)$sent (which will give an array of all children of the <sentence> node) or (array)$sent->word (which will limit it to those called <word>, in case there was a mixture)
  • pass in that array to array_chunk, not $count (which you don't need)
  • don't use the same variable twice with conflicting meanings (foreach( $a as $a ))

So:

$chunks = array_chunk((array)$sent->word, 3);
foreach ($chunks as $a_chunk) {
    echo implode(' ', $a_chunk);
    echo PHP_EOL;   
}

Alternatively, you can do without array_chunk easily enough by just displaying a newline every third word:

$counter = 0;
foreach ( $words as $word ) {
    $counter++;
    echo $word;
    if ( $counter % 3 == 0 ) {
         echo PHP_EOL;
    } else {
         echo ' ';
    }
}

Then all you need to do is nest that loop inside your existing one:

foreach ($arr->content->segment->sentence as $sent) {
    $counter = 0;
    foreach ( $sent->word as $word ) {
        $counter++;
        echo $word;
        if ( $counter % 3 == 0 ) {
             echo PHP_EOL;
        } else {
             echo ' ';
        }
    }
    echo PHP_EOL;
}

Up to you which you think is cleaner, but it's good to understand both so you can adapt them to future needs.

Upvotes: 2

Parfait
Parfait

Reputation: 107587

Consider XSLT, the special-purpose, W3C-conformant language (sibling to XPath) designed to transform XML documents. XSLT can transfrom to text formats. With this approach, no foreach loop or if logic is needed. PHP can run XSLT 1.0 scripts with its built-in php-xsl extension which may need to be enabled in .ini file. And the beauty of XSLT is that it is a well-formed XML file and can be parsed like source XML from file or embedded string.

Specifically, the XSLT runs the Identity Transform to copy document as is without nodes and then for each <word> node, template checks if the current position is a multiple of three to add a line break. And then adds a line break after the last <word>. Notice too at top <xsl:output> method is text.

XSLT (save as .xsl)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>

  <!-- Identity Transform -->
  <xsl:template match="@*|node()">    
    <xsl:apply-templates select="@*|node()"/>    
  </xsl:template>  

  <xsl:template match="word">    
    <xsl:value-of select="concat(., ' ')"/>
    <xsl:if test="(position() mod 3) = 0">
      <xsl:text>&#xa;</xsl:text>
    </xsl:if>
    <xsl:if test="position() = last()">
      <xsl:text>&#xa;</xsl:text>
    </xsl:if>
  </xsl:template>

</xsl:transform>

PHP

// LOAD XML AND XSL
$xml = new DOMDocument();
$xml->load('Input.xml');

$xsl = new DOMDocument;
$xsl->load('XSLTScript.xsl');

// INITIALIZE TRANSFORMER
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); 

// RUN TRANSFORMATION
$newXML = $proc->transformToXML($xml);

// ECHO STRING OUTPUT
echo $newXML;

# Hi there people
# I want to
# introduce you to
# my world
# Hi there people
# I want to
# introduce you to
# my world

Upvotes: 0

Related Questions