Reputation: 720

Searching for a list of words in XML file in Python?

I have this XML file that contains more than 2000 phrases, below is a small sample.

<TEXT>

<PHRASE>
<V>played</V>
<N>John</N>
<PREP>with</PREP>
<en x='PERS'>Adam</en>
<PREP>in</PREP>
<en x='LOC'> ASL school/en>
</PHRASE>

<PHRASE>
<V y='0'>went</V>
<en x='PERS'>Mark</en>
<PREP>to</PREP>
<en x='ORG>United Nations</en>
<PREP>for</PREP>
<PREP>a</PREP>
<N>visit</N>
</PHRASE>

<PHRASE>
<PREP>in</PREP>
<en x='DATE'>1987</en>
<en x='PERS'>Nick</en>
<V>founded</V>
<en x='ORG'>XYZ company</en>
</PHRASE>

<PHRASE>
<en x='ORG'>Google's</en>
<en x='PERS'>Frank</en>
<V>went</V>
<N>yesterday</N>
<PREP>to</PREP>
<en x='LOC'>San Fransisco/en>
</PHRASE>
</TEXT>

And I have a list of patterns:

 finalPatterns=['went \n to \n','created\n  the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']

What I want is to take each finalPattern for example: went to and search for its presence in each phrase in the text, if any phrase contains both went AND to then it print out its 2 <en> tags. [Not if en tags not equal to PERS & ORG nothing is printed]

When it searches for:

-"went" & "to" --> this is the output: Frank -San Fransisco
-"founded" & "in" --> output: Nick-XYZ Company

That's what I did but it didn't work. Nothing was printed.

for phrase in root.findall('./PHRASE'):
 ens = {en.get('x'): en.text for en in phrase.findall('en')}
 if 'ORG' in ens and 'PERS' in ens:
   if all(word in phrase for word in finalPatterns):
      x="".join(phrase.itertext())   #print whats in between [since I would also like to print the whole sentence]
      print("ORG is: {}, PERS is: {} /".format(ens["ORG"],ens["PERS"]))

Upvotes: 0

Answers (2)

Parfait

Reputation: 107652

Consider XSLT (the special-purpose language that manipulates XML documents) in handling your search where it rewrites the original xml according to matched values.

Below XSLT is embedded in Python to dynamically remove unmatched elements using the finalPatterns list. From there, Python can transform (using lxml module) the original document, then use such output for your end use needs.

Python Script

import lxml.etree as ET

finalPatterns=['went \n to \n','created\n  the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']

# BUILDING XSLT FILTER STRING
contains = ''
for p in finalPatterns:
    contains += "("
    for i in p.split('\n '):
        contains += "contains(., '{}') and \n".format(i.replace('\n', '').strip(' '))    
    contains += ")"
    contains = contains.replace(' and \n)', ') or ')

xslstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
            <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
            <xsl:strip-space elements="*"/>

              <!-- Identity Transform -->
              <xsl:template match="@*|node()">
                <xsl:copy>
                  <xsl:apply-templates select="@*|node()"/>
                </xsl:copy>
              </xsl:template>

               <!-- Rewrites Matching Phrase elements -->
               <xsl:template match="PHRASE">
                <xsl:copy>      
                  <wholetext>
                    <xsl:call-template name="join">
                      <xsl:with-param name="valueList" select="*"/>
                      <xsl:with-param name="separator" select="' '"/>
                    </xsl:call-template>
                  </wholetext>

                  <xsl:choose>
                      <xsl:when test="contains(., 'went') = True and contains(., 'to') = True">
                        <match>went to</match>
                      </xsl:when>
                      <xsl:when test="contains(., 'founded') = True and contains(., 'in') = True">
                        <match>founded in</match>
                      </xsl:when>
                      <xsl:when test="contains(., 'created') = True and contains(., 'the') = True">
                        <match>created the</match>
                      </xsl:when>
                      <xsl:otherwise test="contains(., 'a') = True and contains(., 'visit') = True">
                        <match>a visit</match>
                      </xsl:otherwise>
                  </xsl:choose>
                  <person><xsl:value-of select="en[@x='PERS']"/></person>
                  <organization><xsl:value-of select="en[@x='ORG']"/></organization>
                  <location><xsl:value-of select="en[@x='LOC']"/></location>
                </xsl:copy>
              </xsl:template>

              <!-- Rewrites Unmatched Phrase elements -->
              <xsl:template match="PHRASE[not({0})]"/>

              <!-- Join Text values -->
              <xsl:template name="join">
                <xsl:param name="valueList" select="''"/>
                <xsl:param name="separator" select="','"/>
                <xsl:for-each select="$valueList">
                  <xsl:choose>
                    <xsl:when test="position() = 1">
                      <xsl:value-of select="."/>
                    </xsl:when>
                    <xsl:otherwise>
                      <xsl:value-of select="concat($separator, .) "/>
                    </xsl:otherwise>
                  </xsl:choose>
                </xsl:for-each>
              </xsl:template>

            </xsl:transform>'''.format(contains[:-4])    

dom = ET.parse(os.path.join(cd, 'SearchWords.xml'))
xslt = ET.fromstring(xslstr)

transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
print(tree_out.decode("utf-8"))

for phrase in newdom.findall('PHRASE'):    
    print("Text: {} \n ORG is: {}, PERS is: {} /".format(phrase.find('wholetext').text,
                                                         phrase.find('organization').text,
                                                          phrase.find('person').text))

Output

Below includes transformed xml for demonstration. The tree_out string can be saved externally as a new xml file.

<TEXT>
  <PHRASE>
    <wholetext>went Mark to United Nations for a visit</wholetext>
    <person>Mark</person>
    <organization>United Nations</organization>
    <location/>
  </PHRASE>
  <PHRASE>
    <wholetext>in 1987 Nick founded XYZ company</wholetext>
    <person>Nick</person>
    <organization>XYZ company</organization>
    <location/>
  </PHRASE>
  <PHRASE>
    <wholetext>Google's Frank went yesterday to San Fransisco</wholetext>
    <person>Frank</person>
    <organization>Google's</organization>
    <location>San Fransisco</location>
  </PHRASE>
</TEXT>

Text: went Mark to United Nations for a visit 
 ORG is: United Nations, PERS is: Mark /
Text: in 1987 Nick founded XYZ company 
 ORG is: XYZ company, PERS is: Nick /
Text: Google's Frank went yesterday to San Fransisco 
 ORG is: Google's, PERS is: Frank /

List Comprehension

See list comprehension attempt using xpath. However, the challenge is your finalPatterns does not match on congruous matches. For instance text may use went \n to with words in between like went \n Mark \n to. If you only include one keyword per element of the list, then below can work. Otherwise consider regex for pattern recognition.

dom = ET.parse(os.path.join(cd, 'Input.xml'))

phraselist = dom.xpath('//PHRASE')    
for phrase in phraselist:    
    if any(word in p for p in phrase.xpath('./*/text()') for word in finalPatterns):
        print(' '.join(phrase.xpath('./*/text()')))
        print('ORG is: {0}, PERS is: {1}'.format(phrase.xpath("./en[@x='ORG']")[0].text, \
                                                 phrase.xpath("./en[@x='PERS']")[0].text))

Upvotes: 1

DisappointedByUnaccountableMod

Reputation: 6826

This should do the trick:

phrasewords = [w.text for w in phrase.findall('V')+phrase.findall('N')+phrase.findall('PREP')]
for words in finalPatterns:
    if all(word in phrasewords for word in words.split()):
         print "found"

Upvotes: 1

Searching for a list of words in XML file in Python?

Answers (2)

Related Questions