Reputation: 101

Create table from wrapped text in R

Edited:

From text based on variable named a I would like to obtain a table in which description cell will be unwrapped.

a <- 
  "
   category     variable    description                    value
   A            A           This is variable named as A    123
                            which is responsible for sth
                B           This is variable named as B    222.1 
                            which is responsible for sth
                            else
   B            A           This is sth                    2
                                                           out of 4  
   Other c      Other va    This is variable named as      222  
   ategory      riable      other variable which can be   
                            nullable
                Other va    This is variable named as      0  
                riable A    other variable A which can   
                            be nullable
  "

Result which I want to have:

Upvotes: 2

Answers (4)

mario19088

Reputation: 101

I found a way using the idea of @wibeasley in which I generate record_id and then group by this value.

df <- readr::read_fwf(file = a, col_positions = column_widths, skip = 2)

df <- data.table(df)
df[, notnullvar := is.na(variable)]
df[, notnullval := is.na(value)]
df$record_id = 1
for(i in 2:nrow(df)){ #not empty variable cell and not empty value then add +1 to record_id   
  if(df[i,]$notnullvar == FALSE & df[i,]$notnullval == FALSE){
    df[i,]$record_id <- df[i-1,]$record_id + 1
  }else{
    df[i,]$record_id <- df[i-1,]$record_id
  }
}
df <- df[, .(category = paste0(category, collapse = ""), 
             variable = paste0(variable, collapse = ""),
             description = paste0(description, collapse = " "),
             value = paste0(value, collapse = " ")
), by = record_id]

df[,2:5] <- lapply(df[,2:5], function(record_id) trimws(gsub("NA","",record_id)))

The problem was that I would like to scrape table from pdf files in which tables could differ (column widths or wrapped text in various format). Therefore, it would be great to prepare dynamic column width which always contains 4 columns. I think I can find the length of string to the name of next column name for example category variable string contains two columns category and variable and category width is 14 and variable 10.

Upvotes: 0

wibeasley

Reputation: 5287

I'm stumped, now that the example has changed so all cells can wrap to new lines. Hopefully @stefan has an idea.

A few rough suggestions.

Can you control how the text is created? Your OP was so thorough that I assume you explored & eliminated that possibility. But if you have control (such as widening the output parameters), that's the easiest and most direct.
I think you need at least one column to indicate when a new line is a new record, and when a new line is a continuation of the same record. In the first iteration of your OP, the variable column signaled this.
This new indicator variable (called record_id) would have the values {1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5}. See how lines 1 & 2 belong to record 1, and lines 3-5 belong to record 2.
You might programmatically modify the a string somehow. Maybe start with readr::read_lines() and use a regex to indicate what lines represent new records. Then use a typical readr function to convert array of strings to a data.frame.. We'd need to know more about how regular/repeatable the values are for the first two columns. I see you were considering this initially, given your [gsub] tag.
Or read everything with as a fwf first, and use rematch2 to create the variable to indicate whether the record is new or a continuation. It all depends if there's a good pattern to the wrapping cells of variable.
Does a represent a file on disk? How consistent are the line wraps? If record_id is the same for all files, then you could manually determine record_id once, and add it to each dataset. Then proceed similarly to the answers by @stefan and me.

Upvotes: 1

wibeasley

Reputation: 5287

This is similar to @stefans. The main difference is this way requires you to specify column_widths with readr::fwf_cols(). (Which may be an advantage or disadvantage, depending on the consistency/stability of your incoming data files.)

a <- 
"category     variable    description                    value
A            A           This is variable named as A    123
                         which is responsible for sth
             B           This is variable named as B    222.1 
                         which is responsible for sth
                         else
"
column_widths <-
  readr::fwf_cols(
    category        = 13,
    variable        = 8,
    description     = 32,
    value           = 10
  )

I(a) |> 
  readr::read_fwf(
    col_positions = column_widths,
    skip          = 1         # Because the headers are defined in `column_widths`
  ) |> 
  tidyr::fill(category, variable) |> 
  dplyr::mutate(
    value   = as.character(value),
    value   = dplyr::coalesce(value, "")
  ) |> 
  dplyr::group_by(category, variable) |> 
  dplyr::summarize(
    description = paste0(description, collapse = " "), 
    value       = as.numeric(paste0(value, collapse = " ")), 
  ) |> 
  dplyr::ungroup()

Output:

# A tibble: 2 x 4
  category variable description                 value
  <chr>    <chr>    <chr>                       <dbl>
1 A        A        This is variable named as ~  123 
2 A        B        This is variable named as ~  222.

Upvotes: 1

stefan

Reputation: 124038

One option to achieve your desired result would be read your variable as a fixed width file using e.g. readr::read_fwf and some additional data wrangling steps where I make use of tidyr and dplyr:

library(dplyr)
library(tidyr)
library(readr)

df <- readr::read_fwf(file = a, skip = 1)
names(df) <- unlist(df[1, ])
df <- df[-1,]
df %>% 
  filter(!is.na(description)) %>% 
  tidyr::fill(category, variable) %>% 
  group_by(category, variable) %>% 
  summarise(description = paste(description, collapse = " "), value = value[!is.na(value)])
#> `summarise()` has grouped output by 'category'. You can override using the `.groups` argument.
#> # A tibble: 2 × 4
#> # Groups:   category [1]
#>   category variable description                                            value
#>   <chr>    <chr>    <chr>                                                  <chr>
#> 1 A        A        This is variable named as A which is responsible for … 123  
#> 2 A        B        This is variable named as B which is responsible for … 222.1

Upvotes: 4

Create table from wrapped text in R

Answers (4)

Related Questions