user3618451
user3618451

Reputation: 325

Matching strings with str_detect and regex

I have a data frame containing unstructured text. In this reproducible example, I'm downloading a 10K company filing directly from the SEC website and loading it with read.table.

dir = getwd(); setwd(dir)

download.file("https://www.sec.gov/Archives/edgar/data/2648/0000002648-96-000013.txt", file.path(dir,"filing.txt"))

filing <- read.table(file=file.path(dir, "filing.txt"), sep="\t", quote="", comment.char="")
droplevels.data.frame(filing)

I want to remove the SEC header in order to focus on the main body of the document (starting in row 216) and divide my text into sections/items.

> filing$V1[216:218]
[1] PART I 
[2] Item  1.    Business.                                                                                                                                                      
[3]            A.  Organization of Business                                      

Therefore, I'm trying to match strings starting with the word Item (or ITEM) followed by one or more spaces, one or two digits, a dot, one or more spaces and one or more words. For example:

Item 1. Business.
ITEM 1.  BUSINESS
Item  1.  Business
Item 10. Directors and Executive Officers of 
ITEM  10. DIRECTORS AND EXECUTIVE OFFICERS OF THE REGISTRANT

My attempt involves str_detect and regex in order to create a variable count that jumps each time there is a string match.

library(dplyr)
library(stringr)
tidy_filing <- filing %>% mutate(count = cumsum(str_detect(V1, regex("^Item [\\d]{1,2}\\.",ignore_case = TRUE)))) %>% ungroup()

However, I'm missing the first 9 Items and my count starts only with Item 10.

tidy_filing[c(217, 218,251:254),]
       V1                               count
217    Item  1.   Business.               0
218    A.  Organization of Business 3     0
251                          PART III     0
252    Item 10.   Directors etc.   38     1
253    Item 11.   Executive Compens. 38   2
254    Item 12.   Security Ownership. 38  3           

Any help would be highly appreciated.

Upvotes: 0

Views: 879

Answers (1)

Andrew Gustar
Andrew Gustar

Reputation: 18425

The problem is that the single digit items have double spaces in order to align with the two digit ones. You can get round this by changing your regex string to

"^Item\\s+\\d{1,2}\\."

Upvotes: 1

Related Questions