Reputation: 111

Transpose multiple sub-headers into factor column in R

structure(list(
fecha = c("Fuente:La Nueva Viga, DF", "20/02/2020", 
           "20/02/2020", "20/02/2020", "20/02/2020", "Fuente:Monterrey, Nuevo León", 
           "20/02/2020", "20/02/2020", "20/02/2020", "20/02/2020", "17/02/2020", 
           "17/02/2020"), 
producto = c("Fuente:La Nueva Viga, DF", "Aleta de raya", 
           "Bandera", "Besugo", "Cazón con cabeza", "Fuente:Monterrey, Nuevo León", 
             "Huachinango Golfo", "Pampano", "Sargo", "Trucha marina", "Huachinango Golfo", 
            "Pampano"), origen = c("Fuente:La Nueva Viga, DF", "Tabasco", 
             "Campeche", "Veracruz", "Veracruz", "Fuente:Monterrey, Nuevo León", 
            "Tamaulipas", "Tamaulipas", "Tamaulipas", "Tamaulipas", "Tamaulipas", 
            "Tamaulipas"), 
pmin = c("Fuente:La Nueva Viga, DF", "23.00", 
             "35.00", "15.00", "60.00", "Fuente:Monterrey, Nuevo León", "165.00", 
             "--", "--", "--", "210.00", "--"), pmax = c("Fuente:La Nueva Viga, DF", 
             "27.00", "39.00", "19.00", "65.00", "Fuente:Monterrey, Nuevo León", 
             "200.00", "--", "--", "--", "220.00", "--"), 
pfrec = c("Fuente:La Nueva Viga, DF", 
             "25.00", "37.00", "17.00", "63.00", "Fuente:Monterrey, Nuevo León", 
             "190.00", "195.00", "84.00", "98.00", "215.00", "195.00"), 
obs = c("Fuente:La Nueva Viga, DF", 
             "", "", "", "", "Fuente:Monterrey, Nuevo León", "OBS", "OBS", 
             "OBS", "OBS", "OBS", "OBS"), 
category = c("pescado", "pescado", 
             "pescado", "pescado", "pescado", "pescado", "pescado", "pescado", 
             "pescado", "pescado", "pescado", "pescado")), 
row.names = c(2L, 3L, 4L, 5L, 6L, 341L, 342L, 343L, 344L, 345L, 346L, 347L), class = "data.frame")

The dataset above has 6 columns, but the table comes with a sub-header (e.g., Fuente: La Nueva Viga, DF). The full dataset is much larger (> 9000 rows); there are a different number of rows under each sub-header.

I would like to transpose the sub-headers to create a new column called "Fuente" that displays the text after the ":".

Due to the number of rows in the data.frame and inconsistency of the number of columns between each sub-header, I can't easily use rep() or something similar (or at least I can't figure out how).

An example of the output that I am looking for would be this:

  fecha          producto     origen   pmin   pmax  pfrec obs category                Fuente
1 20/02/2020     Aleta de raya    Tabasco  23.00  27.00  25.00      pescado     La Nueva Viga, DF
2 20/02/2020           Bandera   Campeche  35.00  39.00  37.00      pescado     La Nueva Viga, DF
3 20/02/2020            Besugo   Veracruz  15.00  19.00  17.00      pescado     La Nueva Viga, DF
4 20/02/2020  Cazón con cabeza   Veracruz  60.00  65.00  63.00      pescado     La Nueva Viga, DF
5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 OBS  pescado Monterrey, Nuevo León
6 20/02/2020           Pampano Tamaulipas     --     -- 195.00 OBS  pescado Monterrey, Nuevo León

Upvotes: 2

Answers (3)

AndrewGB

Reputation: 16876

Here is one possibility (assuming that all sub-headers begin with Fuente) using tidyverse. Here, I create a grouping column (idx) by gathering all rows until Fuente appears in a subsequent row. Then, I split those into individual dataframes and put into a list. Then, I use map to apply a function to that list. I extract the text after Fuente, then copy that to all rows in that dataframe. Finally, I bind the list of dataframes back together.

Tidyverse

library(tidyverse)

df %>%
  group_by(idx = cumsum(str_detect(fecha, "Fuente"))) %>%
  group_split(., .keep = FALSE) %>%
  map(., function(x)
    x %>%
      mutate(Fuente = sub('.*:\\s*', "", fecha)[1]) %>%
      slice(-1)) %>%
  bind_rows()

Or if you do have other sub-headers besides Fuente, then you could use "[a-z]" in group_by(idx = cumsum(str_detect(fecha, "[a-z]")), rather than "Fuente".

Data.table

Another option using data.table:

setDT(dt)[, Fuente := ifelse(grepl(':', df$fecha, fixed = TRUE),
                             sub('.*:\\s*', "", df$fecha), NA)]
dt[, Fuente := Fuente[nafill(replace(.I, is.na(Fuente), NA), "locf")]]
dt <- dt[!grepl("Fuente", dt$fecha),]

Output

   fecha      producto          origen     pmin   pmax   pfrec  obs   category Fuente               
   <chr>      <chr>             <chr>      <chr>  <chr>  <chr>  <chr> <chr>    <chr>                
 1 20/02/2020 Aleta de raya     Tabasco    23.00  27.00  25.00  ""    pescado  La Nueva Viga, DF    
 2 20/02/2020 Bandera           Campeche   35.00  39.00  37.00  ""    pescado  La Nueva Viga, DF    
 3 20/02/2020 Besugo            Veracruz   15.00  19.00  17.00  ""    pescado  La Nueva Viga, DF    
 4 20/02/2020 Cazón con cabeza  Veracruz   60.00  65.00  63.00  ""    pescado  La Nueva Viga, DF    
 5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 "OBS" pescado  Monterrey, Nuevo León
 6 20/02/2020 Pampano           Tamaulipas --     --     195.00 "OBS" pescado  Monterrey, Nuevo León
 7 20/02/2020 Sargo             Tamaulipas --     --     84.00  "OBS" pescado  Monterrey, Nuevo León
 8 20/02/2020 Trucha marina     Tamaulipas --     --     98.00  "OBS" pescado  Monterrey, Nuevo León
 9 17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 "OBS" pescado  Monterrey, Nuevo León
10 17/02/2020 Pampano           Tamaulipas --     --     195.00 "OBS" pescado  Monterrey, Nuevo León

Benchmark

data.table is faster than any of the tidyverse options

Upvotes: 3

Yuriy Saraykin

Reputation: 8880

library(tidyverse)

df %>%
  mutate(grp = str_detect(string = fecha, pattern = ":"),
         fuente = ifelse(grp, sub('.*:', '', fecha), NA_real_)) %>%
  fill(fuente) %>%
  filter(!grp) %>%
  select(-grp)
#>         fecha          producto     origen   pmin   pmax  pfrec obs category
#> 1  20/02/2020     Aleta de raya    Tabasco  23.00  27.00  25.00      pescado
#> 2  20/02/2020           Bandera   Campeche  35.00  39.00  37.00      pescado
#> 3  20/02/2020            Besugo   Veracruz  15.00  19.00  17.00      pescado
#> 4  20/02/2020  Cazon con cabeza   Veracruz  60.00  65.00  63.00      pescado
#> 5  20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 OBS  pescado
#> 6  20/02/2020           Pampano Tamaulipas     --     -- 195.00 OBS  pescado
#> 7  20/02/2020             Sargo Tamaulipas     --     --  84.00 OBS  pescado
#> 8  20/02/2020     Trucha marina Tamaulipas     --     --  98.00 OBS  pescado
#> 9  17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 OBS  pescado
#> 10 17/02/2020           Pampano Tamaulipas     --     -- 195.00 OBS  pescado
#>                   fuente
#> 1      La Nueva Viga, DF
#> 2      La Nueva Viga, DF
#> 3      La Nueva Viga, DF
#> 4      La Nueva Viga, DF
#> 5  Monterrey, Nuevo Leon
#> 6  Monterrey, Nuevo Leon
#> 7  Monterrey, Nuevo Leon
#> 8  Monterrey, Nuevo Leon
#> 9  Monterrey, Nuevo Leon
#> 10 Monterrey, Nuevo Leon

^{Created on 2022-01-22 by the reprex package (v2.0.1)}

Upvotes: 2

TarJae

Reputation: 79271

Here is an alternative tidyverse approach, main points are adding a new column with add_column from tibble package and data wrangling:

filter all rows that contain Fuente
bind to original df to get equal column lengths!
add a new column by getting long form of df filtering and wrangling
remove with filter ! rows containing Fuente:

library(tidyverse)

df %>% 
  filter(if_any(everything(), ~str_detect(., "Fuente"))) %>% 
  bind_rows(df) %>% 
  add_column(df %>% 
               pivot_longer(everything(), values_to = "Fuente") %>% 
               filter(str_detect(Fuente, "Fuente")) %>% 
               mutate(Fuente = sub('.*:', '', Fuente)) %>% 
               select(-name)
             )%>% 
  filter(!if_any(everything(), ~str_detect(fecha, "Fuente:")))

        fecha          producto     origen   pmin   pmax  pfrec obs category                Fuente
1  20/02/2020     Aleta de raya    Tabasco  23.00  27.00  25.00      pescado     La Nueva Viga, DF
2  20/02/2020           Bandera   Campeche  35.00  39.00  37.00      pescado     La Nueva Viga, DF
3  20/02/2020            Besugo   Veracruz  15.00  19.00  17.00      pescado     La Nueva Viga, DF
4  20/02/2020  Cazón con cabeza   Veracruz  60.00  65.00  63.00      pescado     La Nueva Viga, DF
5  20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 OBS  pescado Monterrey, Nuevo León
6  20/02/2020           Pampano Tamaulipas     --     -- 195.00 OBS  pescado Monterrey, Nuevo León
7  20/02/2020             Sargo Tamaulipas     --     --  84.00 OBS  pescado Monterrey, Nuevo León
8  20/02/2020     Trucha marina Tamaulipas     --     --  98.00 OBS  pescado Monterrey, Nuevo León
9  17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 OBS  pescado Monterrey, Nuevo León
10 17/02/2020           Pampano Tamaulipas     --     -- 195.00 OBS  pescado Monterrey, Nuevo León

Upvotes: 2

Transpose multiple sub-headers into factor column in R

Answers (3)

Related Questions