Reputation: 6874
I have a column in a dataframe that is full of text (of varying lengths) such as
'Nature of specimen= D2x4, stomach biopsies\nbalblablablabl\nabla\nSomeRandomText\nNature of specimen= Colonx2, polypx1\nMore Random Text\nNature of specimen= TIx2, polypx1\n'
I want to only extract Nature of specimen.*?\n
so that I am left with :
Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n
I think I need to gsub everything that is not Nature of specimen.*?\n
but I don't know how to negate a whole regex. At the moment I tried
`df$Text<-gsub("[^(Nature of specimen.*?\n)]","",df$Text`
but that just remove each character in the regex from the text rather than the intended output.
Upvotes: 0
Views: 61
Reputation: 887501
We can also use the more efficient stri_extract
from stringi
library(stringi)
paste(stri_extract_all_regex(str1, "Nature of specimen=.*\n")[[1]], collapse="")
#[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n"
Upvotes: 1
Reputation: 23109
This should also work:
library(stringr)
str_match_all(text, ".*(Nature\\s+of\\s+specimen[^\\n]+)\\n")[[1]][,2]
# [1] "Nature of specimen= D2x4, stomach biopsies" "Nature of specimen= Colonx2, polypx1" "Nature of specimen= TIx2, polypx1"
Upvotes: 0
Reputation: 8413
not a regex
solution(terrible at that) but using here strsplit
:
Basically I'm splitting it up on "\n" and then selecting every alternate values and pasting it back
paste0(unlist(strsplit(x, "\n"))[c(TRUE,FALSE)], collapse = "\n")
[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1"
library(stringr)
paste0(unlist(str_extract_all(x, pattern = "Nature of specimen=.*\n")), collapse = "")
Upvotes: 2