dbo
dbo

Reputation: 1234

how to scrape text from a HTML body

I've never scraped. Would it be straightforward to scrape the text in the main, big gray box only from the link below (starting with header SRUS43 KMSR 271039, ending with .END)? My end goal is to basically have three tidy columns of data from all that text: the five digit codes, the values in inches, and the basin elevation descriptions, so any pointers with processing the text format are welcome, too.

https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6

thank you for any help.

Upvotes: 0

Views: 597

Answers (2)

JasonAizkalns
JasonAizkalns

Reputation: 20483

Reading in the text is fairly easy (see @DiceBoyT answer). Cleaning up the format for three columns is a bit more involved. Below could use some clean-up (especially with the regex), but it gets the job done:

library(tidyverse)
library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text() 

df <- tibble(txt = read_lines(text))

df %>%
  mutate(
    row = row_number(),
    with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
    wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% str_extract("[:digit:]+\\.?[:digit:]"),
    basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% str_sub(start = 2)
  ) %>% 
  separate(with_code, c("code", "val"), sep = "\\s+") %>% 
  mutate(
    combined_val = case_when(
      !is.na(val) ~ val,
      !is.na(wo_code) ~ wo_code,
      TRUE ~ NA_character_
    ) %>% as.numeric
  ) %>%
  filter(!is.na(combined_val)) %>%
  mutate(
    code = zoo::na.locf(code),
    basin_desc = zoo::na.locf(basin_desc)
  ) %>%
  select(
    code, combined_val, basin_desc
  )
#> # A tibble: 643 x 3
#>    code  combined_val basin_desc               
#>    <chr>        <dbl> <chr>                    
#>  1 ACSC1          0   San Antonio Ck - Sunol   
#>  2 ADLC1          0   Arroyo De La Laguna      
#>  3 ADOC1          0   Santa Ana R - Prado Dam  
#>  4 AHOC1          0   Arroyo Honda nr San Jose 
#>  5 AKYC1         41   SF American nr Kyburz    
#>  6 AKYC1          3.2 SF American nr Kyburz    
#>  7 AKYC1         42.2 SF American nr Kyburz    
#>  8 ALQC1          0   Alamo Canal nr Pleasanton
#>  9 ALRC1          0   Alamitos Ck - Almaden Res
#> 10 ANDC1          0   Coyote Ck - Anderson Res 
#> # ... with 633 more rows

Created on 2019-03-27 by the reprex package (v0.2.1)

Upvotes: 2

dave-edison
dave-edison

Reputation: 3736

This is pretty straightforward to scrape with rvest:

library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text() 

Upvotes: 1

Related Questions