Reputation: 29
I am trying to scrape editor data from this page using python scrapy framework.
The problem I am facing is every tag is a sibling tag and the editor role is inside h3 tags and names are inside div tags. All these are inside a div tag with id "editors-section". I can loop through each div tag like
response.css("#editors-section>div.row.align-items-center")
and collect editor name and organization,
but how to collect their respective roles.How to loop through all the tags. Thanks .
Upvotes: 1
Views: 71
Reputation: 10666
Same result but using a bit another approach (and a single for
loop). I find each h3
element (name
) and get the role
(first h2
element above) using preceding
XPath expression:
def parse(self, response):
for h3_node in response.xpath('//div[@class="container"]//h3'):
role = h3_node.xpath('normalize-space(./preceding::h2[1])').get()
name = h3_node.xpath('normalize-space(.)').get()
location = h3_node.xpath("normalize-space(./following-sibling::p[1])").get()
if name and location:
yield{
"role": role,
"name": name,
"location": location,
}
Upvotes: 0
Reputation: 17291
You can use relative xpath expressions and using the following-sibling directive along with testing for adjacent role headers using the selectors root.tag
attribute, you can accurately determine each persons role.
For example:
for header in response.xpath("//h2"):
role = header.xpath("./text()").get()
for sibling in header.xpath("./following-sibling::*"):
if sibling.root.tag == "h2":
break
name = sibling.xpath(".//h3/*/text()").get()
location = sibling.xpath(".//p[@class='mb-2']/text()").get()
if name and location:
yield{
"role": role.strip(),
"name": name.strip(),
"location": location.strip()
}
OUTPUT
[
{
"role": "Editors-in-Chief",
"name": "Hua Wang",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Editors-in-Chief",
"name": "Gabriele Morra",
"location": "University of Louisiana at Lafayette, USA"
},
{
"role": "Board Members",
"name": "Luca Caricchi",
"location": "University of Geneva, Switzerland"
},
{
"role": "Board Members",
"name": "Michael Fehler",
"location": "Massachusetts Institute of Technology, USA"
},
{
"role": "Board Members",
"name": "Peter Gerstoft",
"location": "Scripps Institution of Oceanography, USA"
},
{
"role": "Board Members",
"name": "Forrest M. Hoffman",
"location": "Oak Ridge National Laboratory, United States of America"
},
{
"role": "Board Members",
"name": "Xiangyun Hu",
"location": "China University of Geosciences, China"
},
{
"role": "Board Members",
"name": "Guangmin Hu",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Board Members",
"name": "Qingkai Kong",
"location": "UC Berkeley, USA"
},
{
"role": "Board Members",
"name": "Yuemin Li",
"location": "University of Electronic Science and Technology of China, China"
},
{
"role": "Board Members",
"name": "Hongjun Lin",
"location": "Zhejiang Normal University, China"
},
{
"role": "Board Members",
"name": "Aldo Lipani",
"location": "University College London, United Kingdom"
},
{
"role": "Board Members",
"name": "Zhigang Peng",
"location": "Georgia Institute of Technology, USA"
},
{
"role": "Board Members",
"name": "Piero Poli",
"location": "Grenoble Alpes University, France"
},
{
"role": "Board Members",
"name": "Kunfeng Qiu",
"location": "China University of Geoscience, China"
},
{
"role": "Board Members",
"name": "Calogero Schillaci",
"location": "JRC European Commission, Italy"
},
{
"role": "Board Members",
"name": "Hosein Shahnas",
"location": "University of Toronto, Canada"
},
{
"role": "Board Members",
"name": "Byung-Dal So",
"location": "Kangwon National University, South Korea"
},
{
"role": "Board Members",
"name": "Rui Wang",
"location": "China University of Geoscience, China"
},
{
"role": "Board Members",
"name": "Yong Wang",
"location": "East Carolina University, USA"
},
{
"role": "Board Members",
"name": "Zhiguo Wang",
"location": "Xi'an Jiaotong University, China"
},
{
"role": "Board Members",
"name": "Jun Xia",
"location": "Wuhan University, China"
},
{
"role": "Board Members",
"name": "Lizhi Xiao",
"location": "China University of Petroleum(Beijing), China"
},
{
"role": "Board Members",
"name": "Chicheng Xu",
"location": "Aramco Services Company, USA"
},
{
"role": "Board Members",
"name": "Zhibing Yang",
"location": "Wuhan University, China"
},
{
"role": "Board Members",
"name": "Nana Yoshimitsu",
"location": "Kyoto University, Japan"
},
{
"role": "Board Members",
"name": "Hongyan Zhang",
"location": "Wuhan University, China"
}
]
Upvotes: 1