How to keep linebreaks when getting text with rvest

Question

I am extracting the text of court judgments from a website and want to keep the linebreaks (which I need later for the text analysis). Unfortunately, rvest's html_text removes the linebreaks and e.g. two words originally separated by a become simply concatenated. E.g "GerichtAsylgerichtshof" should actually be "Gericht Asylgerichtshof".

library(rvest, quietly = T, warn.conflicts = F)
library(tidyverse, quietly = T, warn.conflicts = F)

test_url <- "https://www.ris.bka.gv.at//Dokumente/AsylGH/ASYLGHT_20131125_E5_408_113_1_2009_00/ASYLGHT_20131125_E5_408_113_1_2009_00.html"

test_url_parsed <- test_url %>% 
  xml2::read_html() %>% 
  rvest::html_nodes(".contentBlock") 
test_url_parsed
#> {xml_nodeset (5)}
#> [1] 
Gericht
 ...
#> [2] 
Entscheidungsd ...
#> [3] 
Geschäftszahl< ...
#> [4] 
Spruch
< ...
#> [5] 
Text
% 
  html_text()
x[1]
#> [1] "GerichtAsylgerichtshof"



^{Created on 2020-05-14 by the reprex package (v0.3.0)}

I found a few promising leads how to approach the matter, but unfortunately didn't succeed with my specific question. See e.g. here (which replaces html < br > with 
) and the discussion here on github.

Note that the linebreaks 
 appear not only in the headings (e.g. < h1 >), but throughout the text (also < p >).

Many thanks.

Grada Gukovic · Accepted Answer

The problem is that you don't go to the deepest level of the tree before calling html_text.

If you run it as a sapply on the list of children of the level on which you work you get each row as an element of a vector. For example node 1:

html_children(test_url_parsed[[1]]) %>% html_text
[1] "Gericht"         "Asylgerichtshof"

Then you have to paste the parts together:

html_children(test_url_parsed[[1]]) %>% html_text %>% paste0(collapse = "
")
[1] "Gericht
Asylgerichtshof"

The following code runs the operation for all nodes contentBlock nodes and their children:

> resPaste <- lapply(sapply(FUN = html_children, X = test_url_parsed), function(node) paste0(html_text(node), collapse = "
"))

This produces the result you wish:

> str(resPaste)
List of 5
 $ : chr "Gericht
Asylgerichtshof"
 $ : chr "Entscheidungsdatum
25.11.2013"
 $ : chr "Geschäftszahl
E5 408113-1/2009"
 $ : chr "Spruch
Zl. E5 408.113-1/2009/13E

IM NAMEN DER REPUBLIK!

Der Asylgerichtshof hat durch die Richterin Dr. "| __truncated__
 $ : chr "Text
Entscheidungsgründe:

I. Verfahrensgang und Sachverhalt:

I.1.1. Der Beschwerdeführer, ein irakischer"| __truncated__

How to keep linebreaks when getting text with rvest

Answers (1)

Related Questions