Sam Rogers
Sam Rogers

Reputation: 797

Fastest way to pad a dataframe with uneven columns

Similar to this previous question I'm trying to transform a vector into a dataframe in R. I use this trick where I turn it into a matrix and then data frame, but the issue is that some rows potentially have a different number of columns, which throws out my data frame. There can be an arbitrary number of values per row (i.e. not necessarily 3 columns as in the examples), so I check first to work out how many columns I need.

For example, given the example data below, I get a neat data frame.

example <- c(
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c")

# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))

data.frame(matrix(example, ncol = ncols[1], byrow = T))

#      X1    X2    X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 col-a col-b col-c

That's all well and good until I get a vector that has an extra value in one row (i.e. requires and extra column). For example:

example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")

# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))

data.frame(matrix(example, ncol = ncols[1], byrow = T))

#           X1    X2    X3
# 1      col-a col-b col-c
# 2      col-a col-b col-c
# 3 WATCH OUT! col-a col-b
# 4      col-c col-a col-b

Whereas, what I really want is:

#           X1    X2    X3         X4
# 1      col-a col-b col-c         NA
# 2      col-a col-b col-c WATCH OUT!
# 3      col-a col-b col-c         NA

I could deal with this with a double for loop after checking that there is an uneven number of elements between first column elements, but that's not even going to be close to the best option surely.

The additional complication is that the "extra" column could potentially be anywhere, not necessarily the last column. Edit: The column ordering is actually arbitrary, so there's no reason why the extra column has to be in the middle, it could be appended at the end. The is one option I considered, to pull it out and just append it after padding with NA afterwards. The text that should be in the same column are also delimited so it's clear where they belong. Have updated the example below.

Here is some more realistic example data and desired output:

example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")

# Desired output
     X1                         X2                               X3           X4
1 name:start     date:a                              NA   value:b  
2 name:start     date:c  desc:WATCH OUT!   value:d 
3 name:start     date:e                              NA   value:f 

What would be the fastest way to process this?

Thanks in advance!

EDIT: the "blocks" that turn into rows are well defined, so the start and end of a block are clear and finding the size of a block isn't hard, hence my diff(grep(...)) command earlier (could also use dist() for similar result). The WATCH OUT! Text can be arbitrary though, so it's not as simple as searching for WATCH OUT!.

Upvotes: 1

Views: 612

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26238

I am not sure, if the output in this format, is useful

example <- c("name:start",
             "date:a",
             "value:b",
             "name:start",
             "date:c",
             "desc:WATCH OUT!",
             "value:d",
             "name:start",
             "date:e",
             "value:f")
library(tidyverse)

example %>% as.data.frame() %>% setNames('dummy') %>%
  separate(dummy, into=c("name", 'value'), sep = '\\:') %>%
  mutate(rowid = cumsum(name == first(name))) %>%
  pivot_wider(id_cols = rowid, names_from = name, values_from = value)

#> # A tibble: 3 x 5
#>   rowid name  date  value desc      
#>   <int> <chr> <chr> <chr> <chr>     
#> 1     1 start a     b     <NA>      
#> 2     2 start c     d     WATCH OUT!
#> 3     3 start e     f     <NA>

OR perhaps this?


library(tidyverse)

example %>% as.data.frame() %>% setNames('dummy') %>%
  separate(dummy, into=c("name", 'value'), sep = '\\:', remove = F) %>%
  mutate(rowid = cumsum(name == first(name))) %>%
  pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#>   rowid name       date   value   desc           
#>   <int> <chr>      <chr>  <chr>   <chr>          
#> 1     1 name:start date:a value:b <NA>           
#> 2     2 name:start date:c value:d desc:WATCH OUT!
#> 3     3 name:start date:e value:f <NA>

Created on 2021-05-30 by the reprex package (v2.0.0)


For your first example, you could do

``` r
example <- c("col-a",
             "col-b",
             "col-c",
             "col-a",
             "col-b",
             "col-c",
             "WATCH OUT!",
             "col-a",
             "col-b",
             "col-c")
library(tidyverse)

example %>% as.data.frame() %>% setNames('dummy') %>%
  group_by(rowid = cumsum(dummy == first(dummy))) %>%
  mutate(name = paste0('X', row_number())) %>%
  pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)

#> # A tibble: 3 x 5
#> # Groups:   rowid [3]
#>   rowid X1    X2    X3    X4        
#>   <int> <chr> <chr> <chr> <chr>     
#> 1     1 col-a col-b col-c <NA>      
#> 2     2 col-a col-b col-c WATCH OUT!
#> 3     3 col-a col-b col-c <NA>

Created on 2021-05-30 by the reprex package (v2.0.0)

Upvotes: 1

jpdugo17
jpdugo17

Reputation: 7116

Is this useful?

library(tidyverse)
library(rebus)
#> 
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#> 
#>     regex
#> The following object is masked from 'package:ggplot2':
#> 
#>     alpha

example <- c("name:start",
             "date:a",
             "value:b",
             "name:start",
             "date:c",
             "desc:WATCH OUT!",
             "value:d",
             "name:start",
             "date:e",
             "value:f")

example_dirty <- example #i will use it at the end of the script for replacing

custom_pattern <- rebus::or('name:.*', 'date:.', 'value:.') 


alien_text_index <- str_detect(example, pattern = custom_pattern) %>%
    as.character() 
replacement <- which(alien_text_index == 'FALSE') %>%
    `/`(., 3) %>% #in this case every three rows the repetition should start over.
    round() #round for getting an index to modify



example <- str_match(example , pattern = custom_pattern) %>% keep(~!is.na(.))

df <- c('name:.*', 'date:.', 'value:.') %>% 
    map(~example[str_detect(example, .x)])  %>% reduce(bind_cols) %>%
    mutate(..4 = '')
#> New names:
#> * NA -> ...1
#> * NA -> ...2
#> New names:
#> * NA -> ...3





for (i in length(replacement)) {
    df[replacement[i], 4] <- example_dirty[!as.logical(alien_text_index)][i]
}

df
#> # A tibble: 3 x 4
#>   ...1       ...2   ...3    ..4              
#>   <chr>      <chr>  <chr>   <chr>            
#> 1 name:start date:a value:b ""               
#> 2 name:start date:c value:d "desc:WATCH OUT!"
#> 3 name:start date:e value:f ""

Created on 2021-05-29 by the reprex package (v2.0.0)

Upvotes: 1

Related Questions