Beginner
Beginner

Reputation: 282

Extracting Elements from text files in R

I am trying to get into text analysis in R. I have a text file with the following structure.

HD  A YEAR Oxxxx
 WC 244 words
 PD 28 February 2018
 SN XYZ
 SC hydt
 LA English
 CY Copyright 2018 

 LP Rio de Janeiro, Feb 28



TD
   With recreational cannabis only months away from legalization in Canada, companies are racing to
   prepare for the new market. For many, this means partnerships, supply agreements,

I want to extract the following elements (PD and TD) in R, and saved into a table.

I have tried this but I am unable to get it correct.

Extract PD

library(stringr)
library(tidyverse)

pd <- unlist(str_extract_all(txt, "\\bPD\\b\t[0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s"))
pd <- str_replace_all(pd, "\\bPD\\b\t", "")
if (length(pd) == 0) {
  pd <- as.character(NA)
}
pd <- str_trim(pd)
pd <- as.Date(strptime(pd, format = "%d %B %Y"))

Extract TD

td <- unlist(str_extract_all(txt, "\\bTD\\b[\\t\\s]*?.+?\\bCO\\b"))
td <- str_replace_all(td, "\\bTD\\b[\\t\\s]+?", "")
td <- str_replace_all(td, "\\bCO\\b", "")
td <- str_replace_all(td, "\\s+", " ")
if (length(td) == 0) {
  td <- as.character(NA)

I want table as follows please:

PD                        TD
28 February 2018          With recreational cannabis only months away from 
                          legalization in Canada, companies are racing to
                          prepare for the new market. For many, this means 
                          partnerships, supply agreements, Production hit a 
                          record 366.5Mt

Any help would be appreciated. Thank you

Upvotes: 0

Views: 74

Answers (1)

akraf
akraf

Reputation: 3235

[I had to add a few characters to the end of your data set which I inferred from your regexes:

txt <- "HD  A YEAR Oxxxx
 WC 244 words
 PD 28 February 2018
 SN XYZ
 SC hydt
 LA English
 CY Copyright 2018 

 LP Rio de Janeiro, Feb 28



TD
   With recreational cannabis only months away from legalization in Canada, companies are racing to
   prepare for the new market. For many, this means partnerships, supply agreements,
CO ...further stuff"

Dirty

The dirty solution to your problems is probably:

  • For the date field, fix either the regex that it expects not a tab but an arbitrary space after the PD text. E.g. \\bPD\\b [0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s" works for me.
  • For the TD field, make your regex multi-line by using the dotall= option: (See ?stringr::regex)

    td <- unlist(str_extract_all(txt, regex("\\bTD\\b[\\t\\s]*?.+?\\bCO\\b", dotall=TRUE)))
    

Maybe shorter regexes are better?

However, I would recommend you capture the characteristics of your input format only as fine-grained as needed. For example, I would not check the date format via a regex. Just search for "^ PD.*" and let R try to parse the result. It will complain anyway if it does not match.

To filter for a text block which starts with multiple spaces like after the TD marker, you can use the multiline= option to use ^ to match every (not only the first) line beginning. E.g.

str_extract_all(txt, regex("^TD\\s+(^\\s{3}.*\\n)+", multiline = TRUE))

(note that the regex class \s comprises \n so I do not need to specify that explicitly after matching the TD line)

Careful if fields are missing

Finally, your current approach might assign the wrong dates to the text if one of the TD or PD fields are ever missing in the input! A for loop in combination with readLines instead of regex matching might help for this:

Upvotes: 2

Related Questions