Reputation: 114035
First of all, this is a spawn from my previous question. I have posted this again because I was advised to do so by the person whose answer I accepted in the original post as he felt that the question was not properly defined before. Here goes attempt 2:
I am trying to get information out of this webpage. For clarity, following is a selection of a block of the page source:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
From the sample block above, I would like to extract the following information:
ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5
Prerequisite: ANT102H5
I would like to get all such information from the webpage (keep in mind that some courses may have an additionally listed "Corequisite" as well or may not have any pre/co requisites or exclusions listed at all).
I have been trying to write an appropriate xpath expression for this task, but I seem to not be able to get it just right.
Thus far, with the help if Dimitre Novatchev, I have been able to use the following expression:
sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
(//span[@class='title2'])[3]/following-sibling::a[1]/text()")
However, it produces the following output, which seems to get the information for only the first course on the page:
[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n "},
{"desc": "Exclusion: "},
{"desc": "ANT100Y5"},
{"desc": "Prerequisite: "},
{"desc": "ANT102H5"}]
Just to be absolutely clear, this output is correct only insofar as that it gets the correct information regarding the first course. I need the correct information like this for all courses listed on that webpage.
I'm so close but I don't seem to be able to figure out that last step.
I'd appreciate any help... thanks in advance
Upvotes: 2
Views: 2001
Reputation: 26108
Try instead of [<int>]
use something like [position() mod <offset> = <base>]
Offset being the distance between each node you are interested. It may be different for @class='titlestyle' and @class='title2'.
ites = hxs.select("(//p[@class='titlestyle'])[position() mod <offset to next to match> = 2]/text()[1] | (//span[@class='title2'])[position() mod <offset to next to match> = 2]/text() | \
(//span[@class='title2'])[position() mod <offset to next to match> = 2]/following-sibling::a[1]/text() | (//span[@class='title2'])[position() mod <offset to next to match> = 3]/text() | \
(//span[@class='title2'])[position() mod <offset to next to match> = 3]/following-sibling::a[1]/text()")
EDIT: As requested.
One at a time perform each inidividual xpath without constraining on its position. This is a manual fact finding excercise to determine the final values to use in the xpath.
Return all nodes matching the following xpath (this is the first one).
ites = hxs.select("(//p[@class='titlestyle'])/text()[1]")
ites
will contain some you want for the class and some that you do not.
You have already determined for this one the 2nd is the first node you want. Now count the distance to the next one in ites
that you want this rule match on. This is what we can refer to as <offset to next to match>
.
Now repeat the above for each of the remaining xpath searches.
Think of hxs.select("") as filter and as it walks the xml every single thing that matches your xpath will be returned.
Here is an example http://zvon.org/xxl/XPathTutorial/Output/example22.html
Upvotes: 0
Reputation: 243599
The required single XPath expression to select the relevant data for all courses is quite messy, so here I am taking another approach, which can be used (if necessary at all) to produce that single XPath expression:
This simple XSLT transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="p[@class='titlestyle']">
<xsl:text>
===================
</xsl:text>
<xsl:value-of select="text()[1]"/>
</xsl:template>
<xsl:template match=
"span/span[@class='title2'][not(position() >1)]">
<xsl:text>
</xsl:text>
<xsl:value-of select="."/>
<xsl:value-of select="following-sibling::a[1]"/>
<xsl:if test="not(following-sibling::a)">
<xsl:value-of select="following-sibling::text()[1]"/>
</xsl:if>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
when applied on the page at: http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html (tidied up to become a well-formed XML document), produces the wanted result:
===================
Anthropology
===================
ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5
===================
ANT102H5 Introduction to Sociocultural and Linguistic Anthropology
Exclusion: ANT100Y5
===================
ANT200Y5 World Archaeology and Prehistory
Prerequisite: 101H5
===================
ANT203Y5 Biological Anthropology
Prerequisite: 101H5
===================
ANT204Y5 Sociocultural Anthropology
Prerequisite: 101H5
===================
ANT205H5 Introduction to Forensic Anthropology
Prerequisite: 101H5
===================
ANT206Y5 Culture and Communication: Introduction to Linguistic Anthropology
Exclusion: ANT206H5
===================
ANT241Y5 Aboriginal Peoples of North America
===================
ANT299Y5 Research Opportunity Program
===================
ANT304H5 Anthropology and Aboriginal Peoples
Exclusion: ANT304Y5
===================
ANT306H5 Forensic Anthropology Field School
Prerequisite: ANT205H5
===================
ANT308H5 Case Studies in Archaeological Botany and Zoology
Prerequisite: ANT200Y5
===================
ANT309H5 Southeast Asian Archaeology
Prerequisite: ANT200Y5
===================
ANT310H5 Complex Societies
Prerequisite: ANT200Y5
===================
ANT312H5 Archaeological Analysis
Prerequisite: ANT200Y5
===================
ANT313H5 China, Korea and Japan in Prehistory
Prerequisite: ANT200Y5
===================
ANT314H5 Archaeological Theory
Exclusion: ANT411H5
===================
ANT316H5 South Asian Archaeology
Prerequisite: ANT200Y5
===================
ANT317H5 Archaeology of Eastern North America
Prerequisite: ANT200Y5
===================
ANT318H5 Archaeological Fieldwork
Prerequisite: ANT200Y5
===================
ANT320H5 Archaeological Approaches to Technology
Prerequisite: ANT200Y5
===================
ANT322H5 Anthropology of Youth Culture
Exclusion: ANT204Y5
===================
ANT327H5 Agricultural Origins: The Second Revolution
Prerequisite: ANT200Y5
===================
ANT331H5 The Biology of Human Sexuality
Exclusion: ANT330H5
===================
ANT332H5 Human Origins
Exclusion: ANT332Y5
===================
ANT333H5 Human Origins II
Exclusion: ANT332Y5
===================
ANT334H5 Human Osteology
Exclusion: ANT334Y5
===================
ANT335H5 Anthropology of Gender
Exclusion: ANT331Y5
===================
ANT336H5 Molecular Anthropology
Prerequisite: ANT203Y5
===================
ANT338H5 Laboratory Methods in Biological Anthropology
Prerequisite: ANT203Y5
===================
ANT339Y5 Human Adaptation through Biological and Cultural Means
Prerequisite: ANT203Y5
===================
ANT340H5 Osteological Theory
Exclusion: ANT334Y5
===================
ANT350H5 Globalization and the Changing World of Work
Prerequisite: ANT204Y5
===================
ANT351H5 Money, Markets, Gifts: Topics in Economic Anthropology
Prerequisite: ANT204Y5
===================
ANT352H5 Power, Authority, and Legitimacy: Topics in Political Anthropology
Prerequisite: ANT204Y5
===================
ANT358H5 Ethnographic Methods
Prerequisite: ANT204Y5
===================
ANT360H5 Anthropology of Religion
Exclusion: ANT209Y5
===================
ANT361H5 Anthropology of Sub-Saharan Africa
Exclusion: ANT212Y5
===================
ANT362H5 Language in Culture and Society
Prerequisite: ANT204Y5
===================
ANT363H5 Magic, Witchcraft and Science
Prerequisite: ANT360H5
===================
ANT364H5 Lab in Social Interaction
Prerequisite: ANT206H5
===================
ANT365H5 Semiotic Anthropology
Prerequisite: ANT204Y5
===================
ANT368H5 World Religions and Ecology
Exclusion: RLG311H5
===================
ANT369H5 Religious Violence and Nonviolence
Exclusion: RLG317H5
===================
ANT397H5 Independent Study
Prerequisite: Permission of Faculty Advisor
===================
ANT398Y5 Independent Reading
Prerequisite: Permission of Faculty Advisor
===================
ANT399Y5 Research Opportunity Program
Prerequisite: P.I.
===================
ANT401H5 Vocal and Visual Communication
Prerequisite: ANT102H5
===================
ANT414H5 People and Plants in Prehistory
Prerequisite: ANT200Y5
===================
ANT415H5 Faunal Archaeo-Osteology
Exclusion: ANT415Y5
===================
ANT416H5 Advanced Archaeological Analysis
Prerequisite: ANT312H5
===================
ANT418H5 Advanced Archaeological Fieldwork
Prerequisite: ANT318H5
===================
ANT430H5 Special Problems in Biological Anthropology and Archaeology
Prerequisite: P.I
===================
ANT430Y5 Special Problems in Biological Anthropology and Archaeology
Prerequisite: P.I.
===================
ANT431Y5 Special Problems in Sociocultural or Linguistic Anthropology
Prerequisite: P.I.
===================
ANT431H5 Special Problems in Sociocultural or Linguistic Anthropology
Prerequisite: P.I.
===================
ANT432H5 Special Seminar in Anthropology
Prerequisite: P.I.
===================
ANT433H5 Genes, Language, Artifact and Mind
Prerequisite: ANT200Y5
===================
ANT434H5 Palaeopathology
Prerequisite: ANT334Y5
===================
ANT438H5 The Development of Thought in Biological Anthropology
Prerequisite: ANT203Y5
===================
ANT439Y5 Advanced Forensic Anthropology
Prerequisite: ANT205H5
===================
ANT441H5 Advanced Bioarchaeology
Prerequisite: ANT334H5
===================
ANT457H5 Anthropology and the Environment
Prerequisite: ANT102H5
===================
ANT458H5 Anthropology of Crime, Law and Order
Exclusion: ANT204Y5
===================
ANT459H5 The Ethnography of Speaking
Prerequisite: ANT206Y5
===================
ANT460H5 Theory in Sociocultural Anthropology
Prerequisite: ANT204Y5
===================
ANT461H5 Emergent Topics in Socio-Cultural & Linguistic Anthropology
Prerequisite: ANT204Y5
===================
ANT498H5 Advanced Independent Study
Prerequisite: P.I.
===================
ANT499Y5 Advanced Independent Research
Prerequisite: P.I.
Upvotes: 2