Reputation: 282
I am trying to get into text analysis in R. I have a text file with the following structure.
HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
I want to extract the following elements (PD and TD) in R, and saved into a table.
I have tried this but I am unable to get it correct.
Extract PD
library(stringr)
library(tidyverse)
pd <- unlist(str_extract_all(txt, "\\bPD\\b\t[0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s"))
pd <- str_replace_all(pd, "\\bPD\\b\t", "")
if (length(pd) == 0) {
pd <- as.character(NA)
}
pd <- str_trim(pd)
pd <- as.Date(strptime(pd, format = "%d %B %Y"))
Extract TD
td <- unlist(str_extract_all(txt, "\\bTD\\b[\\t\\s]*?.+?\\bCO\\b"))
td <- str_replace_all(td, "\\bTD\\b[\\t\\s]+?", "")
td <- str_replace_all(td, "\\bCO\\b", "")
td <- str_replace_all(td, "\\s+", " ")
if (length(td) == 0) {
td <- as.character(NA)
I want table as follows please:
PD TD
28 February 2018 With recreational cannabis only months away from
legalization in Canada, companies are racing to
prepare for the new market. For many, this means
partnerships, supply agreements, Production hit a
record 366.5Mt
Any help would be appreciated. Thank you
Upvotes: 0
Views: 74
Reputation: 3235
[I had to add a few characters to the end of your data set which I inferred from your regexes:
txt <- "HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
CO ...further stuff"
The dirty solution to your problems is probably:
PD
text. E.g. \\bPD\\b [0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s"
works for me.For the TD
field, make your regex multi-line by using the dotall=
option: (See ?stringr::regex
)
td <- unlist(str_extract_all(txt, regex("\\bTD\\b[\\t\\s]*?.+?\\bCO\\b", dotall=TRUE)))
However, I would recommend you capture the characteristics of your input format only as fine-grained as needed. For example, I would not check the date format via a regex. Just search for "^ PD.*"
and let R try to parse the result. It will complain anyway if it does not match.
To filter for a text block which starts with multiple spaces like after the TD marker, you can use the multiline=
option to use ^
to match every (not only the first) line beginning. E.g.
str_extract_all(txt, regex("^TD\\s+(^\\s{3}.*\\n)+", multiline = TRUE))
(note that the regex class \s
comprises \n
so I do not need to specify that explicitly after matching the TD
line)
Finally, your current approach might assign the wrong dates to the text if one of the TD or PD fields are ever missing in the input! A for
loop in combination with readLines
instead of regex matching might help for this:
Upvotes: 2