bodega18
bodega18

Reputation: 654

Moving consistently spaced Rows to Columns R

Working with some Business Labor Statistics data (https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm). I have scraped the table from this url and am trying to get it into a tidy format. Here is a working example:

Commodity                       jan   feb   mar
Nonmetallic mineral products         
2020                            257.2 258.1 258.5   
2021                            262.6 263.4 264.4
Concrete ingredients
2020                            316.0 316.9 317.8
2021                            328.4 328.4 328.4
Construction gravel
2020                            359.2 360.7 362.1
2021                            375.0 374.7 374.1

How can I get the 2020 and 2021 rows into a "Year" column, and get jan, feb, mar, etc. into a "Month" column like below?

Commodity                     Month Year Value
Nonmetallic mineral products  jan   2020 257.2
Nonmetallic mineral products  feb   2020 258.1
Concrete ingredients          jan   2020 316.0
Concrete ingredients          jan   2021 328.4

Upvotes: 0

Views: 52

Answers (3)

QHarr
QHarr

Reputation: 84465

Again, doing the transformation whilst scraping: You could also use nth-child range css selectors to isolate nodes and combine. Then repeat retrieved elements for the necessary lengths. CSS selectors are a quick way of filtering data.


Some explanation:

This gathers the commodities list and extends it for the number of years * number of months:

Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
    rep(each = length(months)) %>% rep(times = 2)

This joins all the Year 2 values (2021) under all the Year 1 (2020) values:

Value = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)

This stacks the required number of repeats of Year 2 under those of Year 1

 Year = vctrs::vec_c(
    page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
    page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
  ) %>% as.integer()

R:

library(rvest)
library(magrittr)

page <- read_html("https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm")

months <- page %>%
  html_nodes("table#ro3ppiconcrete tr:nth-child(2) > th:nth-child(n+2):nth-child(-n+13)") %>%
  html_text()

df <- data.frame(
  Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
    rep(each = length(months)) %>% rep(times = 2),

  Year = vctrs::vec_c(
    page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
    page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
  ) %>% as.integer(),

  Month = months,

  Value = vctrs::vec_c(
    page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
    page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)
  )
)

# if you wish to remove the provisional flag from Value and have as numeric 
df$Value <- gsub('(p)', '',  df$Value, fixed = T) %>% as.double()

:nth-child()

The :nth-child() CSS pseudo-class matches elements based on their position in a group of siblings.

Functional notation

<An+B> Represents elements in a list whose indices match those found in a custom pattern of numbers, defined by An+B, where:

A is an integer step size,

B is an integer offset,

n is all nonnegative integers, starting from 0.

It can be read as the An+Bth element of a list.

Upvotes: 0

ktiu
ktiu

Reputation: 2626

Another strategy would be to transform the data while scraping:

library(rvest)
library(dpyr)

"https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm" %>%
  read_html() %>%
  html_table() %>%
  first() %>%
  set_names(.[1, ]) %>%
  tail(-1) %>%
  split(ifelse(str_detect(.$Commodity, "\\d{4}"),
               "data", "commodities")) %>%
  with(data %>%
         select(-`Historical data`) %>%
         mutate(Year = as.integer(Commodity),
                Commodity = commodities$Commodity %>%
                              head(nrow(commodities) - 1) %>%
                              rep(times = rep(2, length(.))),
                across(-c(Commodity, Year), readr::parse_number),
                .before = 1)) %>%
  pivot_longer(-c(Year, Commodity)) %>%
  transmute(Commodity, Year, Month = name, Value = value)

Result:

# A tibble: 754 x 4
   Commodity                     Year Month Value
   <chr>                        <int> <chr> <dbl>
 1 Nonmetallic mineral products  2020 Jan    257.
 2 Nonmetallic mineral products  2020 Feb    258.
 3 Nonmetallic mineral products  2020 Mar    258.
 4 Nonmetallic mineral products  2020 Apr    258.
 5 Nonmetallic mineral products  2020 May    257.
 6 Nonmetallic mineral products  2020 Jun    257.
 7 Nonmetallic mineral products  2020 Jul    257.
 8 Nonmetallic mineral products  2020 Aug    257
 9 Nonmetallic mineral products  2020 Sep    258.
10 Nonmetallic mineral products  2020 Oct    257.
# … with 744 more rows

Upvotes: 2

akrun
akrun

Reputation: 887221

We could use tidyverse to transform the data into the required format.

  1. Create a grouping column 'grp' by doing the cumulative sum (cumsum) of a logical vector i.e. presence of letters in the 'Commodity' column
  2. Use mutate to create the 'Year', by replaceing the first element as NA and modify the 'Commodity' by updating it with the first value
  3. Remove the first row with slice
  4. ungroup and reshape the data into long format with pivot_longer
library(dplyr)
library(stringr)
library(tidyr)
df1 %>% 
    group_by(grp = cumsum(str_detect(Commodity, "[A-Za-z]"))) %>% 
    mutate(Year = replace(Commodity,  1, NA), 
           Commodity = first(Commodity)) %>%     
    slice(-1) %>%
    ungroup %>% 
    select(-grp) %>%
    pivot_longer(cols = jan:mar, names_to = 'Month')

-output

# A tibble: 18 x 4
   Commodity                    Year  Month value
   <chr>                        <chr> <chr> <dbl>
 1 Nonmetallic mineral products 2020  jan    257.
 2 Nonmetallic mineral products 2020  feb    258.
 3 Nonmetallic mineral products 2020  mar    258.
 4 Nonmetallic mineral products 2021  jan    263.
 5 Nonmetallic mineral products 2021  feb    263.
 6 Nonmetallic mineral products 2021  mar    264.
 7 Concrete ingredients         2020  jan    316 
 8 Concrete ingredients         2020  feb    317.
 9 Concrete ingredients         2020  mar    318.
10 Concrete ingredients         2021  jan    328.
11 Concrete ingredients         2021  feb    328.
12 Concrete ingredients         2021  mar    328.
13 Construction gravel          2020  jan    359.
14 Construction gravel          2020  feb    361.
15 Construction gravel          2020  mar    362.
16 Construction gravel          2021  jan    375 
17 Construction gravel          2021  feb    375.
18 Construction gravel          2021  mar    374.

data

df1 <- structure(list(Commodity = c("Nonmetallic mineral products", 
"2020", "2021", "Concrete ingredients", "2020", "2021", "Construction gravel", 
"2020", "2021"), jan = c(NA, 257.2, 262.6, NA, 316, 328.4, NA, 
359.2, 375), feb = c(NA, 258.1, 263.4, NA, 316.9, 328.4, NA, 
360.7, 374.7), mar = c(NA, 258.5, 264.4, NA, 317.8, 328.4, NA, 
362.1, 374.1)), class = "data.frame", row.names = c(NA, -9L))

Upvotes: 3

Related Questions