Reputation: 6874

How to delete phrase not specified in a gsub

I have a column in a dataframe that is full of text (of varying lengths) such as

'Nature of specimen= D2x4, stomach biopsies\nbalblablablabl\nabla\nSomeRandomText\nNature of specimen= Colonx2, polypx1\nMore Random Text\nNature of specimen= TIx2, polypx1\n'

I want to only extract Nature of specimen.*?\n so that I am left with :

Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n

I think I need to gsub everything that is not Nature of specimen.*?\n but I don't know how to negate a whole regex. At the moment I tried

`df$Text<-gsub("[^(Nature of specimen.*?\n)]","",df$Text`

but that just remove each character in the regex from the text rather than the intended output.

Upvotes: 0

Answers (3)

akrun

Reputation: 887501

We can also use the more efficient stri_extract from stringi

library(stringi)
paste(stri_extract_all_regex(str1, "Nature of specimen=.*\n")[[1]], collapse="")
#[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n"

Upvotes: 1

Sandipan Dey

Reputation: 23109

This should also work:

library(stringr)
str_match_all(text, ".*(Nature\\s+of\\s+specimen[^\\n]+)\\n")[[1]][,2]
# [1] "Nature of specimen= D2x4, stomach biopsies" "Nature of specimen= Colonx2, polypx1"       "Nature of specimen= TIx2, polypx1"

Upvotes: 0

joel.wilson

Reputation: 8413

not a regex solution(terrible at that) but using here strsplit:

Basically I'm splitting it up on "\n" and then selecting every alternate values and pasting it back

paste0(unlist(strsplit(x, "\n"))[c(TRUE,FALSE)], collapse = "\n")
[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1"


library(stringr)
paste0(unlist(str_extract_all(x, pattern = "Nature of specimen=.*\n")), collapse = "")

Upvotes: 2

How to delete phrase not specified in a gsub

Answers (3)

Related Questions