inspectorG4dget
inspectorG4dget

Reputation: 114035

XPath: Select Current and Next Node's text by Current Node Attributes

First of all, this is a spawn from my previous question. I have posted this again because I was advised to do so by the person whose answer I accepted in the original post as he felt that the question was not properly defined before. Here goes attempt 2:

I am trying to get information out of this webpage. For clarity, following is a selection of a block of the page source:

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                    <span class='distribution'>(SCI)</span></p> 
<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
        onMouseover="showtip(this,event,'24 Lectures')"
        onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
        onMouseover="showtip(this,event,'12 Tutorials')"
        onMouseout="hidetip()">12T</span>]<br> 
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br> 
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br> 


From the sample block above, I would like to extract the following information:

  1. ANT101H5 Introduction to Biological Anthropology and Archaeology
  2. Exclusion: ANT100Y5
  3. Prerequisite: ANT102H5

I would like to get all such information from the webpage (keep in mind that some courses may have an additionally listed "Corequisite" as well or may not have any pre/co requisites or exclusions listed at all).

I have been trying to write an appropriate xpath expression for this task, but I seem to not be able to get it just right.

Thus far, with the help if Dimitre Novatchev, I have been able to use the following expression:

sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
                    (//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
                    (//span[@class='title2'])[3]/following-sibling::a[1]/text()")

However, it produces the following output, which seems to get the information for only the first course on the page:

[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n                        "},
 {"desc": "Exclusion: "},
 {"desc": "ANT100Y5"},
 {"desc": "Prerequisite: "},
 {"desc": "ANT102H5"}]

Just to be absolutely clear, this output is correct only insofar as that it gets the correct information regarding the first course. I need the correct information like this for all courses listed on that webpage.

I'm so close but I don't seem to be able to figure out that last step.

I'd appreciate any help... thanks in advance

Upvotes: 2

Views: 2001

Answers (2)

kevpie
kevpie

Reputation: 26108

Try instead of [<int>] use something like [position() mod <offset> = <base>]

Offset being the distance between each node you are interested. It may be different for @class='titlestyle' and @class='title2'.

ites = hxs.select("(//p[@class='titlestyle'])[position() mod <offset to next to match> = 2]/text()[1] | (//span[@class='title2'])[position() mod <offset to next to match> = 2]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 2]/following-sibling::a[1]/text() | (//span[@class='title2'])[position() mod <offset to next to match> = 3]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 3]/following-sibling::a[1]/text()")

EDIT: As requested.

One at a time perform each inidividual xpath without constraining on its position. This is a manual fact finding excercise to determine the final values to use in the xpath.

Return all nodes matching the following xpath (this is the first one).

ites = hxs.select("(//p[@class='titlestyle'])/text()[1]")

ites will contain some you want for the class and some that you do not.

You have already determined for this one the 2nd is the first node you want. Now count the distance to the next one in ites that you want this rule match on. This is what we can refer to as <offset to next to match>.

Now repeat the above for each of the remaining xpath searches.

Think of hxs.select("") as filter and as it walks the xml every single thing that matches your xpath will be returned.

Here is an example http://zvon.org/xxl/XPathTutorial/Output/example22.html

Upvotes: 0

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243599

The required single XPath expression to select the relevant data for all courses is quite messy, so here I am taking another approach, which can be used (if necessary at all) to produce that single XPath expression:

This simple XSLT transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="p[@class='titlestyle']">
  <xsl:text>&#xA;===================&#xA;</xsl:text>
  <xsl:value-of select="text()[1]"/>
 </xsl:template>

 <xsl:template match=
  "span/span[@class='title2'][not(position() >1)]">
   <xsl:text>&#xA;</xsl:text>
   <xsl:value-of select="."/>
   <xsl:value-of select="following-sibling::a[1]"/>

   <xsl:if test="not(following-sibling::a)">
    <xsl:value-of select="following-sibling::text()[1]"/>
   </xsl:if>
   <xsl:text>&#xA;</xsl:text>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

when applied on the page at: http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html (tidied up to become a well-formed XML document), produces the wanted result:

===================
Anthropology
===================
ANT101H5 Introduction to Biological Anthropology and Archaeology

Exclusion: ANT100Y5

===================
ANT102H5 Introduction to Sociocultural and Linguistic Anthropology

Exclusion: ANT100Y5

===================
ANT200Y5 World Archaeology and Prehistory

Prerequisite: 101H5

===================
ANT203Y5 Biological Anthropology

Prerequisite: 101H5

===================
ANT204Y5 Sociocultural Anthropology

Prerequisite: 101H5

===================
ANT205H5 Introduction to Forensic Anthropology

Prerequisite: 101H5

===================
ANT206Y5 Culture and Communication: Introduction to Linguistic Anthropology

Exclusion: ANT206H5

===================
ANT241Y5 Aboriginal Peoples of North America

===================
ANT299Y5 Research Opportunity Program

===================
ANT304H5 Anthropology and Aboriginal Peoples

Exclusion: ANT304Y5

===================
ANT306H5 Forensic Anthropology Field School

Prerequisite: ANT205H5

===================
ANT308H5 Case Studies in Archaeological Botany and Zoology

Prerequisite: ANT200Y5

===================
ANT309H5 Southeast Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT310H5 Complex Societies

Prerequisite: ANT200Y5

===================
ANT312H5 Archaeological Analysis

Prerequisite: ANT200Y5

===================
ANT313H5 China, Korea and Japan in Prehistory

Prerequisite: ANT200Y5

===================
ANT314H5 Archaeological Theory

Exclusion: ANT411H5

===================
ANT316H5 South Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT317H5 Archaeology of Eastern North America

Prerequisite: ANT200Y5

===================
ANT318H5 Archaeological Fieldwork

Prerequisite: ANT200Y5

===================
ANT320H5 Archaeological Approaches to Technology

Prerequisite: ANT200Y5

===================
ANT322H5 Anthropology of Youth Culture

Exclusion: ANT204Y5

===================
ANT327H5 Agricultural Origins:  The Second Revolution

Prerequisite: ANT200Y5

===================
ANT331H5 The Biology of Human Sexuality

Exclusion: ANT330H5

===================
ANT332H5 Human Origins

Exclusion: ANT332Y5

===================
ANT333H5 Human Origins II

Exclusion: ANT332Y5

===================
ANT334H5 Human Osteology

Exclusion: ANT334Y5

===================
ANT335H5 Anthropology of Gender

Exclusion: ANT331Y5

===================
ANT336H5 Molecular Anthropology

Prerequisite: ANT203Y5

===================
ANT338H5 Laboratory Methods in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT339Y5 Human Adaptation through Biological and Cultural Means

Prerequisite: ANT203Y5

===================
ANT340H5 Osteological Theory

Exclusion: ANT334Y5

===================
ANT350H5 Globalization and the Changing World of Work

Prerequisite: ANT204Y5

===================
ANT351H5 Money, Markets, Gifts: Topics in Economic Anthropology

Prerequisite: ANT204Y5

===================
ANT352H5 Power, Authority, and Legitimacy: Topics in Political Anthropology

Prerequisite: ANT204Y5

===================
ANT358H5 Ethnographic Methods

Prerequisite: ANT204Y5

===================
ANT360H5 Anthropology of Religion

Exclusion: ANT209Y5

===================
ANT361H5 Anthropology of Sub-Saharan Africa

Exclusion: ANT212Y5

===================
ANT362H5 Language in Culture and Society

Prerequisite: ANT204Y5

===================
ANT363H5 Magic, Witchcraft and Science

Prerequisite: ANT360H5

===================
ANT364H5 Lab in Social Interaction

Prerequisite: ANT206H5

===================
ANT365H5 Semiotic Anthropology

Prerequisite: ANT204Y5

===================
ANT368H5 World Religions and Ecology

Exclusion: RLG311H5

===================
ANT369H5 Religious Violence and Nonviolence

Exclusion: RLG317H5

===================
ANT397H5 Independent Study

Prerequisite: Permission of Faculty Advisor


===================
ANT398Y5 Independent Reading

Prerequisite: Permission of Faculty Advisor


===================
ANT399Y5 Research Opportunity Program

Prerequisite: P.I.


===================
ANT401H5 Vocal and Visual Communication

Prerequisite: ANT102H5

===================
ANT414H5 People and Plants in Prehistory

Prerequisite: ANT200Y5

===================
ANT415H5 Faunal Archaeo-Osteology

Exclusion: ANT415Y5

===================
ANT416H5 Advanced Archaeological Analysis

Prerequisite: ANT312H5

===================
ANT418H5 Advanced Archaeological Fieldwork

Prerequisite: ANT318H5

===================
ANT430H5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I


===================
ANT430Y5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I. 


===================
ANT431Y5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT431H5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT432H5 Special Seminar in Anthropology

Prerequisite: P.I.


===================
ANT433H5 Genes, Language, Artifact and Mind

Prerequisite: ANT200Y5

===================
ANT434H5 Palaeopathology

Prerequisite: ANT334Y5

===================
ANT438H5 The Development of Thought in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT439Y5 Advanced Forensic Anthropology

Prerequisite: ANT205H5

===================
ANT441H5 Advanced Bioarchaeology

Prerequisite: ANT334H5

===================
ANT457H5 Anthropology and the Environment

Prerequisite: ANT102H5

===================
ANT458H5 Anthropology of Crime, Law and Order

Exclusion: ANT204Y5

===================
ANT459H5 The Ethnography of Speaking

Prerequisite: ANT206Y5

===================
ANT460H5 Theory in Sociocultural Anthropology

Prerequisite: ANT204Y5

===================
ANT461H5 Emergent Topics in Socio-Cultural &amp;  Linguistic Anthropology

Prerequisite: ANT204Y5

===================
ANT498H5 Advanced Independent Study

Prerequisite: P.I.


===================
ANT499Y5 Advanced Independent Research

Prerequisite: P.I.

Upvotes: 2

Related Questions