Extracting Elements from text files in R

Question

I am trying to get into text analysis in R. I have a text file with the following structure.

HD  A YEAR Oxxxx
 WC 244 words
 PD 28 February 2018
 SN XYZ
 SC hydt
 LA English
 CY Copyright 2018 

 LP Rio de Janeiro, Feb 28



TD
   With recreational cannabis only months away from legalization in Canada, companies are racing to
   prepare for the new market. For many, this means partnerships, supply agreements,

I want to extract the following elements (PD and TD) in R, and saved into a table.

I have tried this but I am unable to get it correct.

Extract PD

library(stringr)
library(tidyverse)

pd <- unlist(str_extract_all(txt, "\bPD\b	[0-9]+?\s[A-Za-z]+?\s[0-9]+\s"))
pd <- str_replace_all(pd, "\bPD\b	", "")
if (length(pd) == 0) {
  pd <- as.character(NA)
}
pd <- str_trim(pd)
pd <- as.Date(strptime(pd, format = "%d %B %Y"))

Extract TD

td <- unlist(str_extract_all(txt, "\bTD\b[\t\s]*?.+?\bCO\b"))
td <- str_replace_all(td, "\bTD\b[\t\s]+?", "")
td <- str_replace_all(td, "\bCO\b", "")
td <- str_replace_all(td, "\s+", " ")
if (length(td) == 0) {
  td <- as.character(NA)

I want table as follows please:

PD                        TD
28 February 2018          With recreational cannabis only months away from 
                          legalization in Canada, companies are racing to
                          prepare for the new market. For many, this means 
                          partnerships, supply agreements, Production hit a 
                          record 366.5Mt

Any help would be appreciated. Thank you

Extracting Elements from text files in R

Answers (1)

Dirty

Maybe shorter regexes are better?

Careful if fields are missing

Related Questions