Reputation: 797
Similar to this previous question I'm trying to transform a vector into a dataframe in R. I use this trick where I turn it into a matrix and then data frame, but the issue is that some rows potentially have a different number of columns, which throws out my data frame. There can be an arbitrary number of values per row (i.e. not necessarily 3 columns as in the examples), so I check first to work out how many columns I need.
For example, given the example data below, I get a neat data frame.
example <- c(
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 col-a col-b col-c
That's all well and good until I get a vector that has an extra value in one row (i.e. requires and extra column). For example:
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 WATCH OUT! col-a col-b
# 4 col-c col-a col-b
Whereas, what I really want is:
# X1 X2 X3 X4
# 1 col-a col-b col-c NA
# 2 col-a col-b col-c WATCH OUT!
# 3 col-a col-b col-c NA
I could deal with this with a double for loop after checking that there is an uneven number of elements between first column elements, but that's not even going to be close to the best option surely.
The additional complication is that the "extra" column could potentially be anywhere, not necessarily the last column.
Edit: The column ordering is actually arbitrary, so there's no reason why the extra column has to be in the middle, it could be appended at the end. The is one option I considered, to pull it out and just append it after padding with NA
afterwards. The text that should be in the same column are also delimited so it's clear where they belong. Have updated the example below.
Here is some more realistic example data and desired output:
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
# Desired output
X1 X2 X3 X4
1 name:start date:a NA value:b
2 name:start date:c desc:WATCH OUT! value:d
3 name:start date:e NA value:f
What would be the fastest way to process this?
Thanks in advance!
EDIT: the "blocks" that turn into rows are well defined, so the start and end of a block are clear and finding the size of a block isn't hard, hence my diff(grep(...))
command earlier (could also use dist()
for similar result). The WATCH OUT! Text can be arbitrary though, so it's not as simple as searching for WATCH OUT!.
Upvotes: 1
Views: 612
Reputation: 26238
I am not sure, if the output in this format, is useful
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:') %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = value)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 start a b <NA>
#> 2 2 start c d WATCH OUT!
#> 3 3 start e f <NA>
OR perhaps this?
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:', remove = F) %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 name:start date:a value:b <NA>
#> 2 2 name:start date:c value:d desc:WATCH OUT!
#> 3 3 name:start date:e value:f <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)
For your first example, you could do
``` r
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
group_by(rowid = cumsum(dummy == first(dummy))) %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> # Groups: rowid [3]
#> rowid X1 X2 X3 X4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 col-a col-b col-c <NA>
#> 2 2 col-a col-b col-c WATCH OUT!
#> 3 3 col-a col-b col-c <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)
Upvotes: 1
Reputation: 7116
Is this useful?
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
example_dirty <- example #i will use it at the end of the script for replacing
custom_pattern <- rebus::or('name:.*', 'date:.', 'value:.')
alien_text_index <- str_detect(example, pattern = custom_pattern) %>%
as.character()
replacement <- which(alien_text_index == 'FALSE') %>%
`/`(., 3) %>% #in this case every three rows the repetition should start over.
round() #round for getting an index to modify
example <- str_match(example , pattern = custom_pattern) %>% keep(~!is.na(.))
df <- c('name:.*', 'date:.', 'value:.') %>%
map(~example[str_detect(example, .x)]) %>% reduce(bind_cols) %>%
mutate(..4 = '')
#> New names:
#> * NA -> ...1
#> * NA -> ...2
#> New names:
#> * NA -> ...3
for (i in length(replacement)) {
df[replacement[i], 4] <- example_dirty[!as.logical(alien_text_index)][i]
}
df
#> # A tibble: 3 x 4
#> ...1 ...2 ...3 ..4
#> <chr> <chr> <chr> <chr>
#> 1 name:start date:a value:b ""
#> 2 name:start date:c value:d "desc:WATCH OUT!"
#> 3 name:start date:e value:f ""
Created on 2021-05-29 by the reprex package (v2.0.0)
Upvotes: 1