Reputation: 93
I have a list of URLs for txt files. The txt files are structured such that some parts are plain text, and some parts are tables. I want to extract the tables and export them to a data frame. Below is an example of a URL:
https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt
The txt files are structured such that the tables begin with <TABLE>
and end with </TABLE>
. I would like to combine all the tables. I have tried using read.delim but I don't know how to use it for just the tables. Below is an example of the expected output. I would appreciate any guidance on how to proceed with my project.
Example of current df:
+----+--------------------------------------------------------------------------+
| ID | URL |
+----+--------------------------------------------------------------------------+
| 1 | https://www.sec.gov/Archives/edgar/data/1000097/0000919574-13-001835.txt |
| 2 | https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt |
| 3 | https://www.sec.gov/Archives/edgar/data/1000742/0000898432-13-000218.txt |
+----+--------------------------------------------------------------------------+
Example of txt file from url:
text text text
text text text
text text text
<TABLE>
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC | COM | 00287Y109 | 1,547 | 45,300 | SHS | | Shared-Defined | 1/2/3 | 45,300 |
| ABERCROMBIE & FITCH | CL A | 002896207 | 4,797 | 100,000 | SHS | | Shared-Defined | 1/2/3 | 100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
</TABLE>
<TABLE>
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC | COM | 00287Y109 | 1,547 | 45,300 | SHS | | Shared-Defined | 1/2/3 | 45,300 |
| ABERCROMBIE & FITCH | CL A | 002896207 | 4,797 | 100,000 | SHS | | Shared-Defined | 1/2/3 | 100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
</TABLE>
Expected output:
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ID | NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| 1 | x | x | x | x | x | x | x | x | x | x |
| 1 | x | x | x | x | x | x | x | x | x | x |
| 1 | x | x | x | x | x | x | x | x | x | x |
| 2 | x | x | x | x | x | x | x | x | x | x |
| 2 | x | x | x | x | x | x | x | x | x | x |
| 2 | x | x | x | x | x | x | x | x | x | x |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
Upvotes: 0
Views: 216
Reputation: 12461
Here's a rough solution.
# Read the text files from the web
fileContents <- readr::read_file("https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt")
# Extract the tables. The regex isn't quite right, as it includes the starting <TABLE>
# and ending </TABLE> tags, but more complicated regexes failed. Regex isn't my
# strong point, and I can handle the extra work
tables <- stringr::str_extract_all(
fileContents,
stringr::regex("(?s)<TABLE>(.*?)</TABLE>",
multiline=TRUE,
dotall=TRUE
)
)
# Function to process a single tibble
toTibble <- function(y) {
lines <- str_split_fixed(y, "\n", n=Inf)
colStarts <- c()
colEnds <- c()
# Scroll through to final table header
for (i in 1:(length(lines)-1)) { # Final line is '</TABLE>' because of initial regex
# Could probably to this with regexes, but my head is hurting
if (any(!is.na(stringr::str_locate(lines[i], "<\\w>")))) {
# Define column widths based on locations of type markers. THIS IS AN ASSUMPTION
colStarts <- stringr::str_locate_all(lines[i], "<\\w>")[[1]][,"start"]
for (i in 1:(length(colStarts)-1)) colEnds[i] <- colStarts[i+1] -1
colEnds[length(colStarts)] <- stringr::str_length(lines[i])
lines <- lines[(i+1):(length(lines)-1)]
data <- dplyr::bind_rows(
lapply(
lines, # For each data line
function(line)
tibble::enframe( # Split in to columns and convert to a tibble of name/value pairs
stringr::str_trim(
stringr::str_sub(
line,
colStarts,
colEnds
)
)
) %>% # Convert from name/value pairs to columns
tidyr::pivot_wider(
values_from="value",
names_from="name",
names_prefix="Column"
)
)
)
# Finished
return(data)
}
}
}
To process a single file:
firstTable <- toTibble(tables[[1]][[1]])
firstTable
Giving
# A tibble: 59 x 12
Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10 Column11 Column12
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 AAR CORP COM 000361105 190 10158 SH "" DEFINED "2" 10158 0 0
2 AAR CORP COM 000361105 15 803 SH "" DEFINED "3" 0 0 803
3 AAR CORP COM 000361105 37 2000 SH "" DEFINED "5" 2000 0 0
4 AAR CORP COM 000361105 78 4200 SH "" DEFINED "" 4200 0 0
5 ABB LTD SPONSORED ADR 000375204 2164 104112 SH "" DEFINED "3" 257 0 103855
6 ABB LTD SPONSORED ADR 000375204 10774 518215 SH "" DEFINED "5" 518215 0 0
7 ABB LTD SPONSORED ADR 000375204 64 3100 SH "" DEFINED "7" 0 3100 0
8 ABB LTD SPONSORED ADR 000375204 1044 50200 SH "" DEFINED "8" 50200 0 0
9 ABB LTD SPONSORED ADR 000375204 9 410 SH "" DEFINED "9" 410 0 0
10 ABB LTD SPONSORED ADR 000375204 103 4958 SH "" DEFINED "15" 4958 0 0
# … with 49 more rows
On my system
system.time({firstTable <- toTibble(tables[[1]][[1]])})
user system elapsed
0.843 0.004 0.849
So under a second to process a table.
length(tables[[1]])
[1] 299
Just under 300 tables in the file, so binding all the tables into a single tibble
alldata <- bind_rows(lapply(tables[[1]], function(t) toTibble(t)))
Should take about five minutes to extract the tables, plus a little more to bind them all together. [This code not tested or timed.]
Once you're at this stage, you can play around with column types and names to get exactly what you want. That should be straightforward.
Upvotes: 2