Reputation: 101
Edited:
From text based on variable named a I would like to obtain a table in which description cell will be unwrapped.
a <-
"
category variable description value
A A This is variable named as A 123
which is responsible for sth
B This is variable named as B 222.1
which is responsible for sth
else
B A This is sth 2
out of 4
Other c Other va This is variable named as 222
ategory riable other variable which can be
nullable
Other va This is variable named as 0
riable A other variable A which can
be nullable
"
Result which I want to have:
Upvotes: 2
Views: 336
Reputation: 101
I found a way using the idea of @wibeasley in which I generate record_id and then group by this value.
df <- readr::read_fwf(file = a, col_positions = column_widths, skip = 2)
df <- data.table(df)
df[, notnullvar := is.na(variable)]
df[, notnullval := is.na(value)]
df$record_id = 1
for(i in 2:nrow(df)){ #not empty variable cell and not empty value then add +1 to record_id
if(df[i,]$notnullvar == FALSE & df[i,]$notnullval == FALSE){
df[i,]$record_id <- df[i-1,]$record_id + 1
}else{
df[i,]$record_id <- df[i-1,]$record_id
}
}
df <- df[, .(category = paste0(category, collapse = ""),
variable = paste0(variable, collapse = ""),
description = paste0(description, collapse = " "),
value = paste0(value, collapse = " ")
), by = record_id]
df[,2:5] <- lapply(df[,2:5], function(record_id) trimws(gsub("NA","",record_id)))
The problem was that I would like to scrape table from pdf files in which tables could differ (column widths or wrapped text in various format).
Therefore, it would be great to prepare dynamic column width which always contains 4 columns. I think I can find the length of string to the name of next column name for example category variable
string contains two columns category and variable and category width is 14 and variable 10.
Upvotes: 0
Reputation: 5287
I'm stumped, now that the example has changed so all cells can wrap to new lines. Hopefully @stefan has an idea.
A few rough suggestions.
Can you control how the text is created? Your OP was so thorough that I assume you explored & eliminated that possibility. But if you have control (such as widening the output parameters), that's the easiest and most direct.
I think you need at least one column to indicate when a new line is a new record, and when a new line is a continuation of the same record. In the first iteration of your OP, the variable
column signaled this.
This new indicator variable (called record_id
) would have the values {1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5}. See how lines 1 & 2 belong to record 1, and lines 3-5 belong to record 2.
You might programmatically modify the a
string somehow. Maybe start with readr::read_lines()
and use a regex to indicate what lines represent new records. Then use a typical readr function to convert array of strings to a data.frame.. We'd need to know more about how regular/repeatable the values are for the first two columns. I see you were considering this initially, given your [gsub] tag.
Or read everything with as a fwf first, and use rematch2 to create the variable to indicate whether the record is new or a continuation. It all depends if there's a good pattern to the wrapping cells of variable
.
Does a
represent a file on disk? How consistent are the line wraps? If record_id
is the same for all files, then you could manually determine record_id
once, and add it to each dataset. Then proceed similarly to the answers by @stefan and me.
Upvotes: 1
Reputation: 5287
This is similar to @stefans. The main difference is this way requires you to specify column_widths
with readr::fwf_cols()
. (Which may be an advantage or disadvantage, depending on the consistency/stability of your incoming data files.)
a <-
"category variable description value
A A This is variable named as A 123
which is responsible for sth
B This is variable named as B 222.1
which is responsible for sth
else
"
column_widths <-
readr::fwf_cols(
category = 13,
variable = 8,
description = 32,
value = 10
)
I(a) |>
readr::read_fwf(
col_positions = column_widths,
skip = 1 # Because the headers are defined in `column_widths`
) |>
tidyr::fill(category, variable) |>
dplyr::mutate(
value = as.character(value),
value = dplyr::coalesce(value, "")
) |>
dplyr::group_by(category, variable) |>
dplyr::summarize(
description = paste0(description, collapse = " "),
value = as.numeric(paste0(value, collapse = " ")),
) |>
dplyr::ungroup()
Output:
# A tibble: 2 x 4
category variable description value
<chr> <chr> <chr> <dbl>
1 A A This is variable named as ~ 123
2 A B This is variable named as ~ 222.
Upvotes: 1
Reputation: 124038
One option to achieve your desired result would be read your variable as a fixed width file using e.g. readr::read_fwf
and some additional data wrangling steps where I make use of tidyr
and dplyr
:
library(dplyr)
library(tidyr)
library(readr)
df <- readr::read_fwf(file = a, skip = 1)
names(df) <- unlist(df[1, ])
df <- df[-1,]
df %>%
filter(!is.na(description)) %>%
tidyr::fill(category, variable) %>%
group_by(category, variable) %>%
summarise(description = paste(description, collapse = " "), value = value[!is.na(value)])
#> `summarise()` has grouped output by 'category'. You can override using the `.groups` argument.
#> # A tibble: 2 × 4
#> # Groups: category [1]
#> category variable description value
#> <chr> <chr> <chr> <chr>
#> 1 A A This is variable named as A which is responsible for … 123
#> 2 A B This is variable named as B which is responsible for … 222.1
Upvotes: 4