Reputation: 2134
I am extracting the text of court judgments from a website and want to keep the linebreaks (which I need later for the text analysis). Unfortunately, rvest's
html_text
removes the linebreaks and e.g. two words originally separated by a \n become simply concatenated. E.g "GerichtAsylgerichtshof" should actually be "Gericht\nAsylgerichtshof".
library(rvest, quietly = T, warn.conflicts = F)
library(tidyverse, quietly = T, warn.conflicts = F)
test_url <- "https://www.ris.bka.gv.at//Dokumente/AsylGH/ASYLGHT_20131125_E5_408_113_1_2009_00/ASYLGHT_20131125_E5_408_113_1_2009_00.html"
test_url_parsed <- test_url %>%
xml2::read_html() %>%
rvest::html_nodes(".contentBlock")
test_url_parsed
#> {xml_nodeset (5)}
#> [1] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Gericht</h1>\n ...
#> [2] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Entscheidungsd ...
#> [3] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Geschäftszahl< ...
#> [4] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Spruch</h1>\n< ...
#> [5] <div class="contentBlock">\n<h1 class="Titel AlignJustify">Text</h1>\n<p ...
#linebreak gets lost
x <- test_url_parsed %>%
html_text()
x[1]
#> [1] "GerichtAsylgerichtshof"
Created on 2020-05-14 by the reprex package (v0.3.0)
I found a few promising leads how to approach the matter, but unfortunately didn't succeed with my specific question. See e.g. here (which replaces html < br > with \n) and the discussion here on github.
Note that the linebreaks \n appear not only in the headings (e.g. < h1 >), but throughout the text (also < p >).
Many thanks.
Upvotes: 0
Views: 730
Reputation: 1253
The problem is that you don't go to the deepest level of the tree before calling html_text
.
If you run it as a sapply
on the list of children of the level on which you work you get each row as an element of a vector. For example node 1:
html_children(test_url_parsed[[1]]) %>% html_text
[1] "Gericht" "Asylgerichtshof"
Then you have to paste the parts together:
html_children(test_url_parsed[[1]]) %>% html_text %>% paste0(collapse = "\n")
[1] "Gericht\nAsylgerichtshof"
The following code runs the operation for all nodes contentBlock
nodes and their children:
> resPaste <- lapply(sapply(FUN = html_children, X = test_url_parsed), function(node) paste0(html_text(node), collapse = "\n"))
This produces the result you wish:
> str(resPaste)
List of 5
$ : chr "Gericht\nAsylgerichtshof"
$ : chr "Entscheidungsdatum\n25.11.2013"
$ : chr "Geschäftszahl\nE5 408113-1/2009"
$ : chr "Spruch\nZl. E5 408.113-1/2009/13E\n\nIM NAMEN DER REPUBLIK!\n\nDer Asylgerichtshof hat durch die Richterin Dr. "| __truncated__
$ : chr "Text\nEntscheidungsgründe:\n\nI. Verfahrensgang und Sachverhalt:\n\nI.1.1. Der Beschwerdeführer, ein irakischer"| __truncated__
Upvotes: 1