Reputation: 720
I have this XML file that contains more than 2000 phrases, below is a small sample.
<TEXT>
<PHRASE>
<V>played</V>
<N>John</N>
<PREP>with</PREP>
<en x='PERS'>Adam</en>
<PREP>in</PREP>
<en x='LOC'> ASL school/en>
</PHRASE>
<PHRASE>
<V y='0'>went</V>
<en x='PERS'>Mark</en>
<PREP>to</PREP>
<en x='ORG>United Nations</en>
<PREP>for</PREP>
<PREP>a</PREP>
<N>visit</N>
</PHRASE>
<PHRASE>
<PREP>in</PREP>
<en x='DATE'>1987</en>
<en x='PERS'>Nick</en>
<V>founded</V>
<en x='ORG'>XYZ company</en>
</PHRASE>
<PHRASE>
<en x='ORG'>Google's</en>
<en x='PERS'>Frank</en>
<V>went</V>
<N>yesterday</N>
<PREP>to</PREP>
<en x='LOC'>San Fransisco/en>
</PHRASE>
</TEXT>
And I have a list of patterns:
finalPatterns=['went \n to \n','created\n the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']
What I want is to take each finalPattern for example: went to and search for its presence in each phrase in the text, if any phrase contains both went AND to then it print out its 2 <en>
tags. [Not if en tags not equal to PERS & ORG nothing is printed]
When it searches for:
-"went" & "to" --> this is the output: Frank -San Fransisco
-"founded" & "in" --> output: Nick-XYZ Company
That's what I did but it didn't work. Nothing was printed.
for phrase in root.findall('./PHRASE'):
ens = {en.get('x'): en.text for en in phrase.findall('en')}
if 'ORG' in ens and 'PERS' in ens:
if all(word in phrase for word in finalPatterns):
x="".join(phrase.itertext()) #print whats in between [since I would also like to print the whole sentence]
print("ORG is: {}, PERS is: {} /".format(ens["ORG"],ens["PERS"]))
Upvotes: 0
Views: 1733
Reputation: 107652
Consider XSLT (the special-purpose language that manipulates XML documents) in handling your search where it rewrites the original xml according to matched values.
Below XSLT is embedded in Python to dynamically remove unmatched elements using the finalPatterns
list. From there, Python can transform (using lxml
module) the original document, then use such output for your end use needs.
Python Script
import lxml.etree as ET
finalPatterns=['went \n to \n','created\n the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']
# BUILDING XSLT FILTER STRING
contains = ''
for p in finalPatterns:
contains += "("
for i in p.split('\n '):
contains += "contains(., '{}') and \n".format(i.replace('\n', '').strip(' '))
contains += ")"
contains = contains.replace(' and \n)', ') or ')
xslstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Rewrites Matching Phrase elements -->
<xsl:template match="PHRASE">
<xsl:copy>
<wholetext>
<xsl:call-template name="join">
<xsl:with-param name="valueList" select="*"/>
<xsl:with-param name="separator" select="' '"/>
</xsl:call-template>
</wholetext>
<xsl:choose>
<xsl:when test="contains(., 'went') = True and contains(., 'to') = True">
<match>went to</match>
</xsl:when>
<xsl:when test="contains(., 'founded') = True and contains(., 'in') = True">
<match>founded in</match>
</xsl:when>
<xsl:when test="contains(., 'created') = True and contains(., 'the') = True">
<match>created the</match>
</xsl:when>
<xsl:otherwise test="contains(., 'a') = True and contains(., 'visit') = True">
<match>a visit</match>
</xsl:otherwise>
</xsl:choose>
<person><xsl:value-of select="en[@x='PERS']"/></person>
<organization><xsl:value-of select="en[@x='ORG']"/></organization>
<location><xsl:value-of select="en[@x='LOC']"/></location>
</xsl:copy>
</xsl:template>
<!-- Rewrites Unmatched Phrase elements -->
<xsl:template match="PHRASE[not({0})]"/>
<!-- Join Text values -->
<xsl:template name="join">
<xsl:param name="valueList" select="''"/>
<xsl:param name="separator" select="','"/>
<xsl:for-each select="$valueList">
<xsl:choose>
<xsl:when test="position() = 1">
<xsl:value-of select="."/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="concat($separator, .) "/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>'''.format(contains[:-4])
dom = ET.parse(os.path.join(cd, 'SearchWords.xml'))
xslt = ET.fromstring(xslstr)
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
print(tree_out.decode("utf-8"))
for phrase in newdom.findall('PHRASE'):
print("Text: {} \n ORG is: {}, PERS is: {} /".format(phrase.find('wholetext').text,
phrase.find('organization').text,
phrase.find('person').text))
Output
Below includes transformed xml for demonstration. The tree_out
string can be saved externally as a new xml file.
<TEXT>
<PHRASE>
<wholetext>went Mark to United Nations for a visit</wholetext>
<person>Mark</person>
<organization>United Nations</organization>
<location/>
</PHRASE>
<PHRASE>
<wholetext>in 1987 Nick founded XYZ company</wholetext>
<person>Nick</person>
<organization>XYZ company</organization>
<location/>
</PHRASE>
<PHRASE>
<wholetext>Google's Frank went yesterday to San Fransisco</wholetext>
<person>Frank</person>
<organization>Google's</organization>
<location>San Fransisco</location>
</PHRASE>
</TEXT>
Text: went Mark to United Nations for a visit
ORG is: United Nations, PERS is: Mark /
Text: in 1987 Nick founded XYZ company
ORG is: XYZ company, PERS is: Nick /
Text: Google's Frank went yesterday to San Fransisco
ORG is: Google's, PERS is: Frank /
List Comprehension
See list comprehension attempt using xpath
. However, the challenge is your finalPatterns
does not match on congruous matches. For instance text may use went \n to
with words in between like went \n Mark \n to
. If you only include one keyword per element of the list, then below can work. Otherwise consider regex for pattern recognition.
dom = ET.parse(os.path.join(cd, 'Input.xml'))
phraselist = dom.xpath('//PHRASE')
for phrase in phraselist:
if any(word in p for p in phrase.xpath('./*/text()') for word in finalPatterns):
print(' '.join(phrase.xpath('./*/text()')))
print('ORG is: {0}, PERS is: {1}'.format(phrase.xpath("./en[@x='ORG']")[0].text, \
phrase.xpath("./en[@x='PERS']")[0].text))
Upvotes: 1
Reputation: 6826
This should do the trick:
phrasewords = [w.text for w in phrase.findall('V')+phrase.findall('N')+phrase.findall('PREP')]
for words in finalPatterns:
if all(word in phrasewords for word in words.split()):
print "found"
Upvotes: 1