Jimmy Sbordone Jr.
Jimmy Sbordone Jr.

Reputation: 23

PHP & XML: How do I compare the text content of two XML Elements?

I am trying to write a script that goes through three existing XML documents and compiles a fourth XML document that contains all of the morphemes (linguist-speak for parts of words) in the existing three. I am trying to make sure this new morpheme database does not contain any duplicates, and I am having trouble getting it to not add duplicates. I will post the relevant snippet immediately below, and the entire chunk of relevant code at the bottom.

The check for duplicates is as follows: ((string)$source == (string)$storySource), where $source and $storySource are both simpleXMLElements like this: <m>text</m>. Can anyone tell me where I went wrong?

Best, Jimmy

Here is the entire loop for going through one of the XML files.

$storycorpus = new SimpleXMLElement($file,null,true);
$storyEntries = $storycorpus->xpath("//morpheme");
foreach($storyEntries as $entry){
    // check to see if in morpheme database. we will match the Pomo and the English, hence, if either is not a match,
    // we will add a new morpheme
    $storySource = $entry->m;
    $storyGloss = $entry->g;
    // set a variable equal to false
    $foundInDB = false; 

    //we will loop through the database looking for a match.    
    foreach($morphemeEntries as $existingMorpheme){
        $source = $existingMorpheme->source;
        $gloss = $existingMorpheme->gloss;

        // if we find a match, we will set our variable to be true and break out of the morpheme DB loop
        if(((string)$source == (string)$storySource) && ((string)$gloss == (string)$storyGloss)){
            $foundInDB = true; // problem: this line isn't firing
            break;
        }
    }
    // after the morphemeDB loop, we will check to see if the var is true. 
    if($foundInDB == true){
        // if it is true, we don't need to enter anything and can 
        // go to the next entry
        continue;
    } else{
        // if we didn't find a match, create a new morpheme
        $newMorphemeEntry = $morphemeDB->addChild("morpheme");
        $newMorphemeEntry->addChild("source", $storySource);
        $newMorphemeEntry->addChild("gloss", $storyGloss);
        $newMorphemeEntry->addChild("root", $storySource);
        $newMorphemeEntry->addChild("hypernym", $storySource);
        $newMorphemeEntry->addChild("link", "S");
        if(substr($storySource, 0, 1) == "-"){
            $newMorphemeEntry->addChild("affix", "suffix");
        } elseif(substr($storySource, -1, 1) == "-"){
            $newMorphemeEntry->addChild("affix", "prefix");
        } else{
            $newMorphemeEntry->addChild("affix", "root");
        }
    }
}

Okay, so I rewrote the block and used DOMDocument instead of SimpleXML, and I still am not having any luck in prevent duplicates. Here is the new code

    // check to see if in morpheme database. we will match the Pomo and the English, hence, if either is not a match,
    // we will add a new morpheme
    $phraseSource = $entry->nodeValue;
    $phraseGlossId = $entry->getAttribute("id");
    $phraseGloss = $xpath2->query("//g[@id =\"$phraseGlossId\"]")->item(0)->nodeValue;
    // set a variable equal to false
    $foundInDB = false; 

    //we will loop through the database looking for a match.    
    foreach($morphemeEntries as $existingMorpheme){
        $source = $existingMorpheme->getElementsByTagName("source")->item(0)->nodeValue;
        $gloss = $existingMorpheme->getElementsByTagName("gloss")->item(0)->nodeValue;
        // if we find a match, we will set our variable to be true and break out of the morpheme DB loop
        if(($source == $phraseSource) && ($gloss == $phraseGloss)){
            $foundInDB = true; // problem: this line isn't firing
            break;
        }
    }
    // after the morphemeDB loop, we will check to see if the var is true. 
    if($foundInDB == true){
        // if it is true, we don't need to enter anything and can 
        // go to the next entry
        continue;
    } else{
        // if we didn't find a match, create a new morpheme
        $newMorphemeEntry = $morphemeXmlDoc->createElement("morpheme");

        $newMorphemeSource = $morphemeXmlDoc->createElement("source");
        $newMorphemeSource->nodeValue = $phraseSource;
        $newMorphemeEntry->appendChild($newMorphemeSource);

        $newMorphemeGloss = $morphemeXmlDoc->createElement("gloss");
        $newMorphemeGloss->nodeValue = $phraseGloss;
        $newMorphemeEntry->appendChild($newMorphemeGloss);

        $newMorphemeRoot = $morphemeXmlDoc->createElement("root");
        $newMorphemeRoot->nodeValue = $phraseSource;
        $newMorphemeEntry->appendChild($newMorphemeRoot);

        $newMorphemeHypernym = $morphemeXmlDoc->createElement("hypernym");
        $newMorphemeHypernym->nodeValue = $phraseSource;
        $newMorphemeEntry->appendChild($newMorphemeHypernym);

        $newMorphemeLink = $morphemeXmlDoc->createElement("link");
        $newMorphemeLink->nodeValue = "P";
        $newMorphemeEntry->appendChild($newMorphemeLink);

        $newMorphemeAffix = $morphemeXmlDoc->createElement("affix");
        $newMorphemeAffix->nodeValue = $phraseGloss;

        if(substr($phraseSource, 0, 1) == "-"){
            $newMorphemeAffix->nodeValue = "suffix";
        } elseif(substr($phraseSource, -1, 1) == "-"){
            $newMorphemeAffix->nodeValue = "prefix";
        } else{
            $newMorphemeAffix->nodeValue = "root";
        }
        $newMorphemeEntry->appendChild($newMorphemeAffix);

        $morphemeRootNode->appendChild($newMorphemeEntry);
    }
}

Here is what the script is searching through to create the new XML sheet:

<phrasicon>
<phrase id="4">
    <ref1>ES</ref1>
    <source>t̪o: xa jo: k'ala:</source>
    <morpheme>
      <m id="4.1">t̪o:</m>
      <m id="4.2">xa</m>
      <m id="4.3">jo:</m>
      <m id="4.4">k'ala:</m>
    </morpheme>
    <gloss lang="en">
      <g id="4.1">me</g>
      <g id="4.2">water</g>
      <g id="4.3">for</g>
      <g id="4.4">die</g>
    </gloss>
    <translation lang="en">I'm dying for water.</translation>
    <media1 mimeType="audio/wav" url="im_dying_for_water.wav"/>
    <ref2/>
    <media2 mimeType="" url=""/>
    <ref3/>
    <media3 mimeType="" url=""/>
  </phrase>
</phrasicon>

Here is what the new morpheme XML sheet ought to look like

<?xml version="1.0" encoding="UTF-8"?>
<morphemedatabase>
<morpheme>
  <source>t̪o:</source>
  <gloss>me</gloss>
  <root>t̪o:</root>
  <hypernym>t̪o:</hypernym>
  <link>P</link>
  <affix>root</affix>
</morpheme>
</morphemedatabase>

Upvotes: 0

Views: 821

Answers (2)

ThW
ThW

Reputation: 19512

I imagine that $morphemeEntries is a fixed list of SimpleXMLElement objects and will not get updated with the added nodes. I suggest using the $morphemeDB object for the check. Additionally you can replace the loop with an Xpath expression.

$storySource = $entry->m;
$storyGloss = $entry->g;

$foundInDB = count(
  $morphemeDB->xpath(
    sprintf('.//morpheme[source="%s" and gloss="%s"]', $storySource, $storyGloss)
  )
) > 0; 

In DOM the same is possible with DOMXpath::evaluate():

$phraseSource = $xpathSource->evaluate('string(m)', $entry);
$phraseGloss = $xpathSource->evaluate('string(g)', $entry);

$foundInDB = $xpathTarget->evaluate(
  sprintf(
    'count(//morpheme[source="%s" and gloss="%s"]) > 0', 
    $storySource, 
    $storyGloss
  )
);

In the DOM implementation you can nest createElement() into appendChild(), but you should add the content as text nodes (for proper escaping):

$newMorphemeEntry = $morphemeRootNode->appendChild(
  $morphemeXmlDoc->createElement("morpheme")
);
$newMorphemeEntry
  ->appendChild($morphemeXmlDoc->createElement("source"))
  ->appendChild($morphemeXmlDoc->createTextNode($phraseSource));
$newMorphemeEntry
  ->appendChild($morphemeXmlDoc->createElement("gloss"))
  ->appendChild($morphemeXmlDoc->createTextNode($phraseGloss));

Upvotes: 1

simonecosci
simonecosci

Reputation: 1214

Don't try to cast to (string) before comparison. call ->asXML() method on each element instead. substitute this:

if(((string)$source == (string)$storySource) && ((string)$gloss == (string)$storyGloss))

with this:

if(($source->asXML() == $storySource->asXML()) && ($gloss->asXML() == $storyGloss->asXML()))

or to compare the contained string (excluding the tags)

if(($source->__toString() == $storySource->__toString()) && ($gloss->__toString() == $storyGloss->__toString()))

The problem is that a SimpleXMLElement is not a "classic" PHP object. SimpleXML is built with "live" API linked to an internal representation of an XML document.

The manual page on Comparing Objects says "Two object instances are equal if they have the same attributes and values, and are instances of the same class."

in print_r() or var_dump() on a SimpleXMLElement appears as properties representing the child nodes and attributes. However, the actual implementation contains only a pointer into a memory structure created when the XML was parsed, which will be different even if you parse the same string twice. Thus simply comparing two SimpleXMLElement objects with == will never return true.

Upvotes: 0

Related Questions