Reputation: 654
Working with some Business Labor Statistics data (https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm). I have scraped the table from this url and am trying to get it into a tidy format. Here is a working example:
Commodity jan feb mar
Nonmetallic mineral products
2020 257.2 258.1 258.5
2021 262.6 263.4 264.4
Concrete ingredients
2020 316.0 316.9 317.8
2021 328.4 328.4 328.4
Construction gravel
2020 359.2 360.7 362.1
2021 375.0 374.7 374.1
How can I get the 2020 and 2021 rows into a "Year" column, and get jan, feb, mar, etc. into a "Month" column like below?
Commodity Month Year Value
Nonmetallic mineral products jan 2020 257.2
Nonmetallic mineral products feb 2020 258.1
Concrete ingredients jan 2020 316.0
Concrete ingredients jan 2021 328.4
Upvotes: 0
Views: 52
Reputation: 84465
Again, doing the transformation whilst scraping: You could also use nth-child range
css selectors to isolate nodes and combine. Then repeat retrieved elements for the necessary lengths. CSS selectors are a quick way of filtering data.
Some explanation:
This gathers the commodities list and extends it for the number of years * number of months:
Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
rep(each = length(months)) %>% rep(times = 2)
This joins all the Year 2 values (2021) under all the Year 1 (2020) values:
Value = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)
This stacks the required number of repeats of Year 2 under those of Year 1
Year = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
) %>% as.integer()
R:
library(rvest)
library(magrittr)
page <- read_html("https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm")
months <- page %>%
html_nodes("table#ro3ppiconcrete tr:nth-child(2) > th:nth-child(n+2):nth-child(-n+13)") %>%
html_text()
df <- data.frame(
Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
rep(each = length(months)) %>% rep(times = 2),
Year = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
) %>% as.integer(),
Month = months,
Value = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)
)
)
# if you wish to remove the provisional flag from Value and have as numeric
df$Value <- gsub('(p)', '', df$Value, fixed = T) %>% as.double()
:nth-child()
The :nth-child() CSS pseudo-class matches elements based on their position in a group of siblings.
Functional notation
<An+B> Represents elements in a list whose indices match those found in a custom pattern of numbers, defined by An+B, where:
A is an integer step size,
B is an integer offset,
n is all nonnegative integers, starting from 0.
It can be read as the An+Bth element of a list.
Upvotes: 0
Reputation: 2626
Another strategy would be to transform the data while scraping:
library(rvest)
library(dpyr)
"https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm" %>%
read_html() %>%
html_table() %>%
first() %>%
set_names(.[1, ]) %>%
tail(-1) %>%
split(ifelse(str_detect(.$Commodity, "\\d{4}"),
"data", "commodities")) %>%
with(data %>%
select(-`Historical data`) %>%
mutate(Year = as.integer(Commodity),
Commodity = commodities$Commodity %>%
head(nrow(commodities) - 1) %>%
rep(times = rep(2, length(.))),
across(-c(Commodity, Year), readr::parse_number),
.before = 1)) %>%
pivot_longer(-c(Year, Commodity)) %>%
transmute(Commodity, Year, Month = name, Value = value)
Result:
# A tibble: 754 x 4
Commodity Year Month Value
<chr> <int> <chr> <dbl>
1 Nonmetallic mineral products 2020 Jan 257.
2 Nonmetallic mineral products 2020 Feb 258.
3 Nonmetallic mineral products 2020 Mar 258.
4 Nonmetallic mineral products 2020 Apr 258.
5 Nonmetallic mineral products 2020 May 257.
6 Nonmetallic mineral products 2020 Jun 257.
7 Nonmetallic mineral products 2020 Jul 257.
8 Nonmetallic mineral products 2020 Aug 257
9 Nonmetallic mineral products 2020 Sep 258.
10 Nonmetallic mineral products 2020 Oct 257.
# … with 744 more rows
Upvotes: 2
Reputation: 887221
We could use tidyverse
to transform the data into the required format.
cumsum
) of a logical vector i.e. presence of letters in the 'Commodity' columnmutate
to create the 'Year', by replace
ing the first element as NA
and modify the 'Commodity' by updating it with the first
valueslice
ungroup
and reshape the data into long format with pivot_longer
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
group_by(grp = cumsum(str_detect(Commodity, "[A-Za-z]"))) %>%
mutate(Year = replace(Commodity, 1, NA),
Commodity = first(Commodity)) %>%
slice(-1) %>%
ungroup %>%
select(-grp) %>%
pivot_longer(cols = jan:mar, names_to = 'Month')
-output
# A tibble: 18 x 4
Commodity Year Month value
<chr> <chr> <chr> <dbl>
1 Nonmetallic mineral products 2020 jan 257.
2 Nonmetallic mineral products 2020 feb 258.
3 Nonmetallic mineral products 2020 mar 258.
4 Nonmetallic mineral products 2021 jan 263.
5 Nonmetallic mineral products 2021 feb 263.
6 Nonmetallic mineral products 2021 mar 264.
7 Concrete ingredients 2020 jan 316
8 Concrete ingredients 2020 feb 317.
9 Concrete ingredients 2020 mar 318.
10 Concrete ingredients 2021 jan 328.
11 Concrete ingredients 2021 feb 328.
12 Concrete ingredients 2021 mar 328.
13 Construction gravel 2020 jan 359.
14 Construction gravel 2020 feb 361.
15 Construction gravel 2020 mar 362.
16 Construction gravel 2021 jan 375
17 Construction gravel 2021 feb 375.
18 Construction gravel 2021 mar 374.
df1 <- structure(list(Commodity = c("Nonmetallic mineral products",
"2020", "2021", "Concrete ingredients", "2020", "2021", "Construction gravel",
"2020", "2021"), jan = c(NA, 257.2, 262.6, NA, 316, 328.4, NA,
359.2, 375), feb = c(NA, 258.1, 263.4, NA, 316.9, 328.4, NA,
360.7, 374.7), mar = c(NA, 258.5, 264.4, NA, 317.8, 328.4, NA,
362.1, 374.1)), class = "data.frame", row.names = c(NA, -9L))
Upvotes: 3