zoowalk
zoowalk

Reputation: 2134

How to keep linebreaks when getting text with rvest

I am extracting the text of court judgments from a website and want to keep the linebreaks (which I need later for the text analysis). Unfortunately, rvest's html_text removes the linebreaks and e.g. two words originally separated by a \n become simply concatenated. E.g "GerichtAsylgerichtshof" should actually be "Gericht\nAsylgerichtshof".

library(rvest, quietly = T, warn.conflicts = F)
library(tidyverse, quietly = T, warn.conflicts = F)

test_url <- "https://www.ris.bka.gv.at//Dokumente/AsylGH/ASYLGHT_20131125_E5_408_113_1_2009_00/ASYLGHT_20131125_E5_408_113_1_2009_00.html"

test_url_parsed <- test_url %>% 
  xml2::read_html() %>% 
  rvest::html_nodes(".contentBlock") 
test_url_parsed
#> {xml_nodeset (5)}
#> [1] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Gericht</h1>\n ...
#> [2] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Entscheidungsd ...
#> [3] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Geschäftszahl< ...
#> [4] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Spruch</h1>\n< ...
#> [5] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Text</h1>\n<p  ...

#linebreak gets lost
x <- test_url_parsed %>% 
  html_text()
x[1]
#> [1] "GerichtAsylgerichtshof"

Created on 2020-05-14 by the reprex package (v0.3.0)

I found a few promising leads how to approach the matter, but unfortunately didn't succeed with my specific question. See e.g. here (which replaces html < br > with \n) and the discussion here on github.

Note that the linebreaks \n appear not only in the headings (e.g. < h1 >), but throughout the text (also < p >).

Many thanks.

Upvotes: 0

Views: 730

Answers (1)

Grada Gukovic
Grada Gukovic

Reputation: 1253

The problem is that you don't go to the deepest level of the tree before calling html_text.

If you run it as a sapply on the list of children of the level on which you work you get each row as an element of a vector. For example node 1:

html_children(test_url_parsed[[1]]) %>% html_text
[1] "Gericht"         "Asylgerichtshof"

Then you have to paste the parts together:

html_children(test_url_parsed[[1]]) %>% html_text %>% paste0(collapse = "\n")
[1] "Gericht\nAsylgerichtshof"

The following code runs the operation for all nodes contentBlock nodes and their children:

> resPaste <- lapply(sapply(FUN = html_children, X = test_url_parsed), function(node) paste0(html_text(node), collapse = "\n"))

This produces the result you wish:

> str(resPaste)
List of 5
 $ : chr "Gericht\nAsylgerichtshof"
 $ : chr "Entscheidungsdatum\n25.11.2013"
 $ : chr "Geschäftszahl\nE5 408113-1/2009"
 $ : chr "Spruch\nZl. E5 408.113-1/2009/13E\n\nIM NAMEN DER REPUBLIK!\n\nDer Asylgerichtshof hat durch die Richterin Dr. "| __truncated__
 $ : chr "Text\nEntscheidungsgründe:\n\nI. Verfahrensgang und Sachverhalt:\n\nI.1.1. Der Beschwerdeführer, ein irakischer"| __truncated__

Upvotes: 1

Related Questions