user3056783
user3056783

Reputation: 2616

How do I get the inner text of elements using the scraper crate?

I am using the scraper library to parse an HTML document and find the node with ID foo.

I would like to use this node for further operations. For this example, I'm trying to reach some nested children with class inner and retrieve the innerText of those children.

use scraper::{Html, Selector};

fn main() {
    let html = String::from(
        r#"
      <html>
        <head>
          <title>Test</title>
        </head>
        <body>
          <div id="foo"><div></div><div><div></div><div class="inner"><span>x<div>yo</div></span></div></div></div>
        </body>
      </html>
    "#,
    );

    let parsed_html = Html::parse_document(&html);
    let fragment = parsed_html
        .select(&Selector::parse("body").unwrap())
        .next()
        .unwrap();
    let foo = fragment
        .select(&Selector::parse("div#foo").unwrap())
        .next()
        .unwrap();

    let text = foo
        .children()
        .nth(1)
        .unwrap()
        .children()
        .nth(1)
        .unwrap()
        .children()
        .map(|child| child.value())
        .collect::<Vec<_>>();

    println!("{:?}", text);
}

my Cargo.toml file:

[package]
name = "scraper"
version = "0.1.0"
authors = ["foo@bar"]
edition = "2018"

[dependencies]
scraper = "0.12.0"

The output of rustup show:

Default host: x86_64-apple-darwin
rustup home:  /Users/foobar/.rustup

stable-x86_64-apple-darwin (directory override for '/Users/foobar')
rustc 1.43.1 (8d69840ab 2020-05-04)

The console prints out [Element(<span>)] which is the result of the mapping function where I call value method on element.

The outcome I'm expecting is xyo.

Does scraper crate have some methods that can extract the text like I want it to or would I have to create some kind of recursive function?

I know this code is prone to errors and I will use the match operator to handle cases where certain nodes aren't present in documents. For now I'm only focusing on how to get the innerText property on children nodes.

Upvotes: 3

Views: 4099

Answers (1)

Ten
Ten

Reputation: 1427

scraper has a method to extract the text: ElementRef::text.

A way to achieve what you're looking for from the .children() calls would be:

...
.children()
.filter_map(|child| ElementRef::wrap(child))
.flat_map(|el| el.text())
.collect::<Vec<_>>(); // Or `.collect::<String>()` if you want xyo concatenated

However, given your example, I feel you may want to use a selector to directly get the ElementRef that corresponds to your target instead of doing the work with lots of .children()s:

let inner: String = parsed_html
    .select(&Selector::parse("body div#foo:nth-child(1):nth-child(1)").unwrap()) // or "body div#foo div.inner"
    .flat_map(|el| el.text())
    .collect();

This would look closer to what is in the scraper documentation.

Upvotes: 5

Related Questions