astaines
astaines

Reputation: 922

Use a text dataframe as input to read_delim

I am reading data from many messy pdf's. The end result, for each pdf, is an R dataframe, with one variable, which I call text. Each row of the dataframe contains a textual representation of one row of the original pdf table. The elements within each row are separated with the '|' character. One of these looks like this :-

Total number of confirmed cases|965
nd
Number of new cases 22 March|129
Total number hospitalised|277|28.7
Total number admitted to ICU|36|3.7
Total number of deaths|6
Case Fatality Ratio (CFR )|0.6
Total number of outbreaks or clusters|44
Total number cases associated with clusters|243|25.2 Total number of imported cases|201|20.8
Number of cases in HCW|247|25.6
Median age (years)|45
Age range (years)|0-95

and it can be recreated from this :-

structure(list(Text = c("Total number of confirmed cases|965", 
"nd", "Number of new cases 22 March|129", "Total number hospitalised|277|28.7", 
"Total number admitted to ICU|36|3.7", "Total number of deaths|6", 
"Case Fatality Ratio (CFR )|0.6", "Total number of outbreaks or clusters|44", 
"Total number cases associated with clusters|243|25.2", "Total number of imported cases|201|20.8", 
"Number of cases in HCW|247|25.6", "Median age (years)|45", "Age range (years)|0-95"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-13L))

At the moment, I write this out using write_csv(), and read it back in using read_delim() this :-

write_csv(HPSC,'data/temp.csv')

HPSC <- read_delim('data/temp.csv', delim='|',
                 skip=1, na = c('-'),
                 col_names = c('Characteristic','Number','Pct')) %>%
mutate(Date = dmy('08.04.2020')) # In the real version, the date comes from elsewhere.

It's fiddly, but not very hard to stick a lot of these together, and get my final result - essentially a dataframe with one row per pdf.

So, this all works fine. I need the full functionality of read_delim because otherwise I have to write my own code to identify separators, skip lines, and to handle the fact that there are varying numbers of items on each line. I'm quite happy with what I'm doing, but I was vaguely surprised that I couldn't just feed a one column dataframe to read_delim, and have it do its stuff.

Is there a way to do this?

Upvotes: 0

Views: 160

Answers (1)

Jake
Jake

Reputation: 96

Your intuition is correct--you can pass your column to read_delim without saving it as a CSV first, but you need to pass it as a vector, not a table. Using your data structure I believe I've produced your requested functionality here:

x = structure(list(Text = c("Total number of confirmed cases|965", 
                            "nd", "Number of new cases 22 March|129", "Total number hospitalised|277|28.7", 
                            "Total number admitted to ICU|36|3.7", "Total number of deaths|6", 
                            "Case Fatality Ratio (CFR )|0.6", "Total number of outbreaks or clusters|44", 
                            "Total number cases associated with clusters|243|25.2", "Total number of imported cases|201|20.8", 
                            "Number of cases in HCW|247|25.6", "Median age (years)|45", "Age range (years)|0-95"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
                                                            -13L))

read_delim(x$Text, delim='|',
           skip=1, na = c('-'),
           col_names = c('Characteristic','Number','Pct')) %>%
  mutate(Date = dmy('08.04.2020'))

See how the 'Text' column is selected from the object.

Upvotes: 1

Related Questions