Reputation: 23
I have a problem when tidying a table from website scraping. I want to get the table (with header V1 to V5) from the link below, but I failed to convert it into the same format in R studio.
This is what I'm doing
url <- "https://www.r-bloggers.com/2018/08/using-control-charts-in-r/"
library(rvest)
library(tidyverse)
h <- read_html(url)
tab <- h %>% html_nodes("table")
tab <- tab[[2]] %>% html_table()
tab <- separate_rows(tab, 1, sep = " ")
tab <- tab[8:132,]
tab <- as.data.frame(tab)
tab1 <- data.frame(c("V1", "V2", "V3", "V4", "V5"))
tab1 <- tab1 %>% setNames("Cat")
tab2 <- cbind(tab1,tab)
tab3 <- tab2 %>% spread(key = Cat, X1)
Here is the result
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 125 rows:
* 1, 6, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 76, 81, 86, 91, 96, 101, 106, 111, 116, 121
* 2, 7, 12, 17, 22, 27, 32, 37, 42, 47, 52, 57, 62, 67, 72, 77, 82, 87, 92, 97, 102, 107, 112, 117, 122
* 3, 8, 13, 18, 23, 28, 33, 38, 43, 48, 53, 58, 63, 68, 73, 78, 83, 88, 93, 98, 103, 108, 113, 118, 123
* 4, 9, 14, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 79, 84, 89, 94, 99, 104, 109, 114, 119, 124
* 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125
So what should I do to get the same table as from the website?
And if you can think of a better way to get the table from this website, please tell me.
P/s: I'm learning R programming on my own, so please teach me!
Cheers.
Upvotes: 2
Views: 52
Reputation: 389135
Here's a way :
library(rvest)
url <- "https://www.r-bloggers.com/2018/08/using-control-charts-in-r/"
url %>%
read_html %>%
html_nodes('table') %>%
.[[2]] %>%
html_table() %>%
dplyr::pull(X1) %>%
stringr::str_extract_all('\\d+\\.\\d+') %>%
.[[1]] %>%
matrix(ncol = 5, byrow = TRUE) %>%
as.data.frame() %>% type.convert() -> tab
tab
# V1 V2 V3 V4 V5
#1 1.45 1.56 1.40 1.45 1.33
#2 1.75 1.53 1.55 1.42 1.42
#3 1.60 1.41 1.35 1.52 1.36
#4 1.53 1.58 1.54 1.71 1.55
#5 1.48 1.34 1.64 1.59 1.46
#6 1.69 1.55 1.49 1.61 1.47
#...
#...
Upvotes: 2