Reputation: 111
structure(list(
fecha = c("Fuente:La Nueva Viga, DF", "20/02/2020",
"20/02/2020", "20/02/2020", "20/02/2020", "Fuente:Monterrey, Nuevo León",
"20/02/2020", "20/02/2020", "20/02/2020", "20/02/2020", "17/02/2020",
"17/02/2020"),
producto = c("Fuente:La Nueva Viga, DF", "Aleta de raya",
"Bandera", "Besugo", "Cazón con cabeza", "Fuente:Monterrey, Nuevo León",
"Huachinango Golfo", "Pampano", "Sargo", "Trucha marina", "Huachinango Golfo",
"Pampano"), origen = c("Fuente:La Nueva Viga, DF", "Tabasco",
"Campeche", "Veracruz", "Veracruz", "Fuente:Monterrey, Nuevo León",
"Tamaulipas", "Tamaulipas", "Tamaulipas", "Tamaulipas", "Tamaulipas",
"Tamaulipas"),
pmin = c("Fuente:La Nueva Viga, DF", "23.00",
"35.00", "15.00", "60.00", "Fuente:Monterrey, Nuevo León", "165.00",
"--", "--", "--", "210.00", "--"), pmax = c("Fuente:La Nueva Viga, DF",
"27.00", "39.00", "19.00", "65.00", "Fuente:Monterrey, Nuevo León",
"200.00", "--", "--", "--", "220.00", "--"),
pfrec = c("Fuente:La Nueva Viga, DF",
"25.00", "37.00", "17.00", "63.00", "Fuente:Monterrey, Nuevo León",
"190.00", "195.00", "84.00", "98.00", "215.00", "195.00"),
obs = c("Fuente:La Nueva Viga, DF",
"", "", "", "", "Fuente:Monterrey, Nuevo León", "OBS", "OBS",
"OBS", "OBS", "OBS", "OBS"),
category = c("pescado", "pescado",
"pescado", "pescado", "pescado", "pescado", "pescado", "pescado",
"pescado", "pescado", "pescado", "pescado")),
row.names = c(2L, 3L, 4L, 5L, 6L, 341L, 342L, 343L, 344L, 345L, 346L, 347L), class = "data.frame")
The dataset above has 6 columns, but the table comes with a sub-header (e.g., Fuente: La Nueva Viga, DF
).
The full dataset is much larger (> 9000 rows); there are a different number of rows under each sub-header.
I would like to transpose the sub-headers to create a new column called "Fuente" that displays the text after the ":".
Due to the number of rows in the data.frame and inconsistency of the number of columns between each sub-header, I can't easily use rep()
or something similar (or at least I can't figure out how).
An example of the output that I am looking for would be this:
fecha producto origen pmin pmax pfrec obs category Fuente
1 20/02/2020 Aleta de raya Tabasco 23.00 27.00 25.00 pescado La Nueva Viga, DF
2 20/02/2020 Bandera Campeche 35.00 39.00 37.00 pescado La Nueva Viga, DF
3 20/02/2020 Besugo Veracruz 15.00 19.00 17.00 pescado La Nueva Viga, DF
4 20/02/2020 Cazón con cabeza Veracruz 60.00 65.00 63.00 pescado La Nueva Viga, DF
5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 OBS pescado Monterrey, Nuevo León
6 20/02/2020 Pampano Tamaulipas -- -- 195.00 OBS pescado Monterrey, Nuevo León
Upvotes: 2
Views: 217
Reputation: 16876
Here is one possibility (assuming that all sub-headers begin with Fuente
) using tidyverse
. Here, I create a grouping column (idx
) by gathering all rows until Fuente
appears in a subsequent row. Then, I split those into individual dataframes and put into a list. Then, I use map
to apply a function to that list. I extract the text after Fuente
, then copy that to all rows in that dataframe. Finally, I bind the list of dataframes back together.
Tidyverse
library(tidyverse)
df %>%
group_by(idx = cumsum(str_detect(fecha, "Fuente"))) %>%
group_split(., .keep = FALSE) %>%
map(., function(x)
x %>%
mutate(Fuente = sub('.*:\\s*', "", fecha)[1]) %>%
slice(-1)) %>%
bind_rows()
Or if you do have other sub-headers besides Fuente
, then you could use "[a-z]" in group_by(idx = cumsum(str_detect(fecha, "[a-z]"))
, rather than "Fuente".
Data.table
Another option using data.table
:
setDT(dt)[, Fuente := ifelse(grepl(':', df$fecha, fixed = TRUE),
sub('.*:\\s*', "", df$fecha), NA)]
dt[, Fuente := Fuente[nafill(replace(.I, is.na(Fuente), NA), "locf")]]
dt <- dt[!grepl("Fuente", dt$fecha),]
Output
fecha producto origen pmin pmax pfrec obs category Fuente
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20/02/2020 Aleta de raya Tabasco 23.00 27.00 25.00 "" pescado La Nueva Viga, DF
2 20/02/2020 Bandera Campeche 35.00 39.00 37.00 "" pescado La Nueva Viga, DF
3 20/02/2020 Besugo Veracruz 15.00 19.00 17.00 "" pescado La Nueva Viga, DF
4 20/02/2020 Cazón con cabeza Veracruz 60.00 65.00 63.00 "" pescado La Nueva Viga, DF
5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 "OBS" pescado Monterrey, Nuevo León
6 20/02/2020 Pampano Tamaulipas -- -- 195.00 "OBS" pescado Monterrey, Nuevo León
7 20/02/2020 Sargo Tamaulipas -- -- 84.00 "OBS" pescado Monterrey, Nuevo León
8 20/02/2020 Trucha marina Tamaulipas -- -- 98.00 "OBS" pescado Monterrey, Nuevo León
9 17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 "OBS" pescado Monterrey, Nuevo León
10 17/02/2020 Pampano Tamaulipas -- -- 195.00 "OBS" pescado Monterrey, Nuevo León
Benchmark
data.table
is faster than any of the tidyverse
options
Upvotes: 3
Reputation: 8880
library(tidyverse)
df %>%
mutate(grp = str_detect(string = fecha, pattern = ":"),
fuente = ifelse(grp, sub('.*:', '', fecha), NA_real_)) %>%
fill(fuente) %>%
filter(!grp) %>%
select(-grp)
#> fecha producto origen pmin pmax pfrec obs category
#> 1 20/02/2020 Aleta de raya Tabasco 23.00 27.00 25.00 pescado
#> 2 20/02/2020 Bandera Campeche 35.00 39.00 37.00 pescado
#> 3 20/02/2020 Besugo Veracruz 15.00 19.00 17.00 pescado
#> 4 20/02/2020 Cazon con cabeza Veracruz 60.00 65.00 63.00 pescado
#> 5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 OBS pescado
#> 6 20/02/2020 Pampano Tamaulipas -- -- 195.00 OBS pescado
#> 7 20/02/2020 Sargo Tamaulipas -- -- 84.00 OBS pescado
#> 8 20/02/2020 Trucha marina Tamaulipas -- -- 98.00 OBS pescado
#> 9 17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 OBS pescado
#> 10 17/02/2020 Pampano Tamaulipas -- -- 195.00 OBS pescado
#> fuente
#> 1 La Nueva Viga, DF
#> 2 La Nueva Viga, DF
#> 3 La Nueva Viga, DF
#> 4 La Nueva Viga, DF
#> 5 Monterrey, Nuevo Leon
#> 6 Monterrey, Nuevo Leon
#> 7 Monterrey, Nuevo Leon
#> 8 Monterrey, Nuevo Leon
#> 9 Monterrey, Nuevo Leon
#> 10 Monterrey, Nuevo Leon
Created on 2022-01-22 by the reprex package (v2.0.1)
Upvotes: 2
Reputation: 79271
Here is an alternative tidyverse
approach, main points are adding a new column with add_column
from tibble
package and data wrangling:
Fuente
!
rows containing Fuente:
library(tidyverse)
df %>%
filter(if_any(everything(), ~str_detect(., "Fuente"))) %>%
bind_rows(df) %>%
add_column(df %>%
pivot_longer(everything(), values_to = "Fuente") %>%
filter(str_detect(Fuente, "Fuente")) %>%
mutate(Fuente = sub('.*:', '', Fuente)) %>%
select(-name)
)%>%
filter(!if_any(everything(), ~str_detect(fecha, "Fuente:")))
fecha producto origen pmin pmax pfrec obs category Fuente
1 20/02/2020 Aleta de raya Tabasco 23.00 27.00 25.00 pescado La Nueva Viga, DF
2 20/02/2020 Bandera Campeche 35.00 39.00 37.00 pescado La Nueva Viga, DF
3 20/02/2020 Besugo Veracruz 15.00 19.00 17.00 pescado La Nueva Viga, DF
4 20/02/2020 Cazón con cabeza Veracruz 60.00 65.00 63.00 pescado La Nueva Viga, DF
5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 OBS pescado Monterrey, Nuevo León
6 20/02/2020 Pampano Tamaulipas -- -- 195.00 OBS pescado Monterrey, Nuevo León
7 20/02/2020 Sargo Tamaulipas -- -- 84.00 OBS pescado Monterrey, Nuevo León
8 20/02/2020 Trucha marina Tamaulipas -- -- 98.00 OBS pescado Monterrey, Nuevo León
9 17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 OBS pescado Monterrey, Nuevo León
10 17/02/2020 Pampano Tamaulipas -- -- 195.00 OBS pescado Monterrey, Nuevo León
Upvotes: 2