Reputation: 325
I have a data frame containing unstructured text. In this reproducible example, I'm downloading a 10K company filing directly from the SEC website and loading it with read.table.
dir = getwd(); setwd(dir)
download.file("https://www.sec.gov/Archives/edgar/data/2648/0000002648-96-000013.txt", file.path(dir,"filing.txt"))
filing <- read.table(file=file.path(dir, "filing.txt"), sep="\t", quote="", comment.char="")
droplevels.data.frame(filing)
I want to remove the SEC header in order to focus on the main body of the document (starting in row 216) and divide my text into sections/items.
> filing$V1[216:218]
[1] PART I
[2] Item 1. Business.
[3] A. Organization of Business
Therefore, I'm trying to match strings starting with the word Item (or ITEM) followed by one or more spaces, one or two digits, a dot, one or more spaces and one or more words. For example:
Item 1. Business.
ITEM 1. BUSINESS
Item 1. Business
Item 10. Directors and Executive Officers of
ITEM 10. DIRECTORS AND EXECUTIVE OFFICERS OF THE REGISTRANT
My attempt involves str_detect and regex in order to create a variable count that jumps each time there is a string match.
library(dplyr)
library(stringr)
tidy_filing <- filing %>% mutate(count = cumsum(str_detect(V1, regex("^Item [\\d]{1,2}\\.",ignore_case = TRUE)))) %>% ungroup()
However, I'm missing the first 9 Items and my count starts only with Item 10.
tidy_filing[c(217, 218,251:254),]
V1 count
217 Item 1. Business. 0
218 A. Organization of Business 3 0
251 PART III 0
252 Item 10. Directors etc. 38 1
253 Item 11. Executive Compens. 38 2
254 Item 12. Security Ownership. 38 3
Any help would be highly appreciated.
Upvotes: 0
Views: 879
Reputation: 18425
The problem is that the single digit items have double spaces in order to align with the two digit ones. You can get round this by changing your regex string to
"^Item\\s+\\d{1,2}\\."
Upvotes: 1