Reputation: 2616
I am using the scraper
library to parse an HTML document and find the node with ID foo
.
I would like to use this node for further operations. For this example, I'm trying to reach some nested children with class inner
and retrieve the innerText
of those children.
use scraper::{Html, Selector};
fn main() {
let html = String::from(
r#"
<html>
<head>
<title>Test</title>
</head>
<body>
<div id="foo"><div></div><div><div></div><div class="inner"><span>x<div>yo</div></span></div></div></div>
</body>
</html>
"#,
);
let parsed_html = Html::parse_document(&html);
let fragment = parsed_html
.select(&Selector::parse("body").unwrap())
.next()
.unwrap();
let foo = fragment
.select(&Selector::parse("div#foo").unwrap())
.next()
.unwrap();
let text = foo
.children()
.nth(1)
.unwrap()
.children()
.nth(1)
.unwrap()
.children()
.map(|child| child.value())
.collect::<Vec<_>>();
println!("{:?}", text);
}
my Cargo.toml
file:
[package]
name = "scraper"
version = "0.1.0"
authors = ["foo@bar"]
edition = "2018"
[dependencies]
scraper = "0.12.0"
The output of rustup show
:
Default host: x86_64-apple-darwin
rustup home: /Users/foobar/.rustup
stable-x86_64-apple-darwin (directory override for '/Users/foobar')
rustc 1.43.1 (8d69840ab 2020-05-04)
The console prints out [Element(<span>)]
which is the result of the mapping function where I call value
method on element.
The outcome I'm expecting is xyo
.
Does scraper
crate have some methods that can extract the text like I want it to or would I have to create some kind of recursive function?
I know this code is prone to errors and I will use the match
operator to handle cases where certain nodes aren't present in documents. For now I'm only focusing on how to get the innerText
property on children nodes.
Upvotes: 3
Views: 4099
Reputation: 1427
scraper
has a method to extract the text: ElementRef::text
.
A way to achieve what you're looking for from the .children()
calls would be:
...
.children()
.filter_map(|child| ElementRef::wrap(child))
.flat_map(|el| el.text())
.collect::<Vec<_>>(); // Or `.collect::<String>()` if you want xyo concatenated
However, given your example, I feel you may want to use a selector to directly get the ElementRef
that corresponds to your target instead of doing the work with lots of .children()
s:
let inner: String = parsed_html
.select(&Selector::parse("body div#foo:nth-child(1):nth-child(1)").unwrap()) // or "body div#foo div.inner"
.flat_map(|el| el.text())
.collect();
This would look closer to what is in the scraper
documentation.
Upvotes: 5