Reputation: 23
<?xml-stylesheet type="text/css" href="home.css"?>
<Header type="text">
<encodingDesc>
<samplingDesc>Samples taken from page 10-11,20-21,38-39, 54-55, 70-71, 80-81, 98-99, 122-123, 142-143, 148-149, 162-163, 174-175 </samplingDesc>
</encodingDesc>
<sourceDesc>
<mainContent>
<source> Abhinesh
<category>Natural, Physical and Professional Sciences</category>
<subcategory>Textile Technology</subcategory>
<text> Book </text>
<title> cloths </title>
<vol> 1 </vol>
<issue/>
</source>
<textDes>
<type/>
<headline/>
<author> V. Nurjan </author>
<translator/>
<words>3364</words>
</textDes>
</mainContent>
</sourceDesc>
<profileDesc>
<creation>
<date> 21-Dec-2010 </date>
<inputter> Abhinesh </inputter>
</creation>
<langUsage> Telugu </langUsage>
<textClass>
<channel mode="w"> print </channel>
<domain type="public"/>
</textClass>
</profileDesc>
</Header>
I checked every example on the internet but they are only giving the code for simple XML files but not this type. How can I extract the tagged data from such an XML file?
Upvotes: 0
Views: 691
Reputation: 10514
You could use a simple XSL Transformation for your purpose. To extract all the texts as a text file you could make use of the following XSL stylesheet.
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*" />
<xsl:output method="text" encoding="UTF-8" />
<xsl:template match="node()">
<xsl:if test="boolean(normalize-space(text()))">
<xsl:value-of select="normalize-space(text())" /><xsl:text>
</xsl:text>
</xsl:if>
<xsl:apply-templates select="node()"/>
</xsl:template>
</xsl:stylesheet>
To execute this stylesheet you would need an XSL Parser like Saxon or xsltproc
if you use Unix like operating system.
You could also test it easily with IE, Firefox or any other browser you want.
Just save the stylesheet in the same folder your xml source file is. As for example test.xsl
and then change the header of your xml file from
<?xml-stylesheet type="text/css" href="home.css"?>
to
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
Then the output will look like that
Samples taken from page 10-11,20-21,38-39, 54-55, 70-71, 80-81, 98-99, 122-123, 142-143, 148-149, 162-163, 174-175
Abhinesh
Natural, Physical and Professional Sciences
Textile Technology
Book
cloths
1
V. Nurjan
3364
21-Dec-2010
Abhinesh
Telugu
print
Upvotes: 1