Sebastian Zeki
Sebastian Zeki

Reputation: 6874

Multiline text extraction in R with stringr

I have a column in my dataframe which has free text in it

I would like to extract the text after INDICATIONS FOR EXAMINATION and before the next capitalized line. In the example below the result would be 'Anaemia'

INDICATIONS FOR EXAMINATION
Anaemia

PROCEDURE PERFORMED
Gastroscopy (OGD)

I am having some trouble as I'm using stringr and I can't seem to get multiline matches. I have been using:

EoE$IndicationsFroExamination<-str_extract(EoE$Endo_ResultText, '(?<=INDICATIONS FOR EXAMINATION).*?[A-Z]+')

Upvotes: 1

Views: 1060

Answers (2)

krzyklo
krzyklo

Reputation: 46

It requires a little digging. You can use the regex() modifier function.

  1. Use the multiline argument to switch on multiline fitting:
str_extract_all("a\nb\nc", "^.")
# [[1]]
# [1] "a"

str_extract_all("a\nb\nc", regex("^.", multiline = TRUE))
# [[1]]
# [1] "a" "b" "c"
  1. Please be aware of the dotall argument, that will switch on multiline behaviour of ".*":
str_extract_all("a\nb\nc", "a.")
# [[1]]
# character(0)

str_extract_all("a\nb\nc", regex("a.", dotall = TRUE))
# [[1]]
# [1] "a\n"

These are documented in stringi::stri_opts_regex(), which stringr::regex() passes arguments to.

Upvotes: 3

clmarquart
clmarquart

Reputation: 4721

I made the regular expression a bit more generic so it will match all occurrences and used the str_extract_all package from stringr:

matches <- str_extract_all(str, "(?<=[A-Z]\n)([^\n]*)")

Which, given the string you provided, should return:

[[1]]
[1] "Anaemia"           "Gastroscopy (OGD)"

Upvotes: 2

Related Questions