R: Read text files with blanks and unequal number of columns

Question

I am trying to read many text files into R using read.table. Most of the time we have clean text files which have defined columns.

The data that I am trying to read comes from ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt

You can see that the blanks and length of text files varies by report. ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt

My objective is to read many of these text files and combine them into a dataset.

If I can read one of the them then compiling should not be an issue. However, I am running into several issues because of the format of the text file:

1) the number of FIRMS vary from report to report. For example, sometimes there will be 3 rows (i.e. 3 firms that did business on that data) of data to import and sometimes there may be 10.

2) Blanks are being recognized. For example, under the FIRM section there should be a column for Deliveries (DEL) and Receipts (REC). The data when it is read in THIS section should look like:

df <- data.frame("FIRM_#" = c(407, 685, 800, 905), 
  "FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
  "DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))

however when I read this in the fomatting is all messed up and does not put NA for the blank values

3) The above issues apply for "YARDS" and "FUTURE DELIVERIES SCHEDULED" section of the text file.

I have tried to read in sections of the text file and then format it accordingly but since the the number of firms change day to day the code does not generalize.

Any help would greatly be appreciated.

Manuel Bickel · Accepted Answer

Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.

You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.

Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...

library(rvest)
library(dplyr)

page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")

table <- page %>%
  html_text("pre") %>%
  #reformat by splitting on line breakes
  { unlist(strsplit(., "
")) } %>%
  #select range based on strings in specific lines
  "["(.,(grep("FIRM #", .):(grep("        DELIVERIES SCHEDULED", .)-1))) %>%
  #exclude empty rows
  "["(., !grepl("^\s+$", .)) %>%
  #fix width of table to the right
  { substring(., 1, nchar(gsub("\s+$", "" , .[1]))) } %>%
  #strip white space on the left
  { gsub("^\s+", "", .) }


headline <- unlist(strsplit(table[1], "\s{2,}"))

get_split_position <- function(substring, string) {

   nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))

}

#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {

   get_split_position(x, table[1])

})


#exclude headline from split
table <- lapply(table[-1], function(x) {

  substring(x,  c(1, split_positions + 1),  c(split_positions, nchar(x)))

})

table <- do.call(rbind, table)
colnames(table) <- headline

#strip whitespace
table <- gsub("\s+", "", table)

table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2,  as.numeric)

table
# FIRM #          FIRM NAME DEL REC
# 1    407      STRAITSFINLLC   1  NA
# 2    685   R.J.O'BRIENASSOC   1  18
# 3    800 ROSENTHALCOLLINSLL  15  NA
# 4    905 ADMINVESTORSERVICE   1  NA

R: Read text files with blanks and unequal number of columns

Answers (1)

Related Questions