How to extract tables from partly unstructured txt files in R?

Question

I have a list of URLs for txt files. The txt files are structured such that some parts are plain text, and some parts are tables. I want to extract the tables and export them to a data frame. Below is an example of a URL:

https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt

The txt files are structured such that the tables begin with

and end with

. I would like to combine all the tables. I have tried using read.delim but I don't know how to use it for just the tables. Below is an example of the expected output. I would appreciate any guidance on how to proceed with my project.

Example of current df:
+----+--------------------------------------------------------------------------+
| ID |                                   URL                                    |
+----+--------------------------------------------------------------------------+
|  1 | https://www.sec.gov/Archives/edgar/data/1000097/0000919574-13-001835.txt |
|  2 | https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt |
|  3 | https://www.sec.gov/Archives/edgar/data/1000742/0000898432-13-000218.txt |
+----+--------------------------------------------------------------------------+

Example of txt file from url:

text text text
text text text
text text text


+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|   NAME OF ISSUER    | TITLE OF CLASS |   CUSIP   | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC          | COM            | 00287Y109 |        1,547 |          45,300 | SHS    |          | Shared-Defined        | 1/2/3       |           45,300 |
| ABERCROMBIE & FITCH | CL A           | 002896207 |        4,797 |         100,000 | SHS    |          | Shared-Defined        | 1/2/3       |          100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+



+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|   NAME OF ISSUER    | TITLE OF CLASS |   CUSIP   | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC          | COM            | 00287Y109 |        1,547 |          45,300 | SHS    |          | Shared-Defined        | 1/2/3       |           45,300 |
| ABERCROMBIE & FITCH | CL A           | 002896207 |        4,797 |         100,000 | SHS    |          | Shared-Defined        | 1/2/3       |          100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+




Expected output:
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ID | NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|  1 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  1 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  1 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  2 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  2 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  2 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+

Limey · Accepted Answer

Here's a rough solution.

# Read the text files from the web
fileContents <- readr::read_file("https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt")
# Extract the tables.  The regex isn't quite right, as it includes the starting 
# and ending  tags, but more complicated regexes failed.  Regex isn't my
# strong point, and I can handle the extra work
tables <- stringr::str_extract_all(
            fileContents, 
            stringr::regex("(?s)(.*?)", 
              multiline=TRUE, 
              dotall=TRUE
            )
          )

# Function to process a single tibble
toTibble <- function(y) {
  lines <- str_split_fixed(y, "
", n=Inf)
  colStarts <- c()
  colEnds <- c()
  # Scroll through to final table header
  for (i in 1:(length(lines)-1)) { # Final line is '' because of initial regex
    # Could probably to this with regexes, but my head is hurting
    if (any(!is.na(stringr::str_locate(lines[i], "<\w>")))) {
      # Define column widths based on locations of type markers.  THIS IS AN ASSUMPTION
      colStarts <- stringr::str_locate_all(lines[i], "<\w>")[[1]][,"start"]
      for (i in 1:(length(colStarts)-1)) colEnds[i] <- colStarts[i+1] -1
      colEnds[length(colStarts)] <- stringr::str_length(lines[i])
      lines <- lines[(i+1):(length(lines)-1)]
      data <- dplyr::bind_rows(
                lapply(
                  lines,                   # For each data line  
                  function(line) 
                    tibble::enframe(       # Split in to columns and convert to a tibble of name/value pairs
                      stringr::str_trim(
                        stringr::str_sub(
                          line, 
                          colStarts, 
                          colEnds
                        )
                      )
                    ) %>%                  # Convert from name/value pairs to columns
                    tidyr::pivot_wider(
                      values_from="value", 
                      names_from="name", 
                      names_prefix="Column"
                    )
                  )
                )
      # Finished
      return(data)
    }
  }
}

To process a single file:

firstTable <- toTibble(tables[[1]][[1]])
firstTable

Giving

# A tibble: 59 x 12
   Column1  Column2       Column3   Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12
                                                  
 1 AAR CORP COM           000361105 190     10158   SH      ""      DEFINED "2"     10158    0        0       
 2 AAR CORP COM           000361105 15      803     SH      ""      DEFINED "3"     0        0        803     
 3 AAR CORP COM           000361105 37      2000    SH      ""      DEFINED "5"     2000     0        0       
 4 AAR CORP COM           000361105 78      4200    SH      ""      DEFINED ""      4200     0        0       
 5 ABB LTD  SPONSORED ADR 000375204 2164    104112  SH      ""      DEFINED "3"     257      0        103855  
 6 ABB LTD  SPONSORED ADR 000375204 10774   518215  SH      ""      DEFINED "5"     518215   0        0       
 7 ABB LTD  SPONSORED ADR 000375204 64      3100    SH      ""      DEFINED "7"     0        3100     0       
 8 ABB LTD  SPONSORED ADR 000375204 1044    50200   SH      ""      DEFINED "8"     50200    0        0       
 9 ABB LTD  SPONSORED ADR 000375204 9       410     SH      ""      DEFINED "9"     410      0        0       
10 ABB LTD  SPONSORED ADR 000375204 103     4958    SH      ""      DEFINED "15"    4958     0        0       
# … with 49 more rows

On my system

system.time({firstTable <- toTibble(tables[[1]][[1]])})
   user  system elapsed 
  0.843   0.004   0.849

So under a second to process a table.

length(tables[[1]])
[1] 299

Just under 300 tables in the file, so binding all the tables into a single tibble

alldata <- bind_rows(lapply(tables[[1]], function(t) toTibble(t)))

Should take about five minutes to extract the tables, plus a little more to bind them all together. [This code not tested or timed.]

Once you're at this stage, you can play around with column types and names to get exactly what you want. That should be straightforward.

How to extract tables from partly unstructured txt files in R?

Answers (1)

Related Questions