Sebastian Zeki
Sebastian Zeki

Reputation: 6874

Create column with sample of phrase list, if the phrase is absent from column text

Aim

I have a list of phrases. I also have a dataframe with one column containing text. I want to create a new column in the dataframe containing a (random number) of a sample of the list of phrases as long as the phrase is not present in the dataframe column

The input dataframe:

structure(list(report = c("Biopsies of small bowel mucosa including Brunner's glands",
                            "These are fragments of small bowel mucosa which include Brunner's glands ",
                            "These are fragments of small bowel mucosa which include Brunner's glands There is no evidence of coeliac disease in these biopsies",
                            "There is coeliac disease here. ",
                            "Biopsies of specialisd gastric mucosa with moderate acute and active inflammation.",
                            "These are fragments of small bowel mucosa. The small bowel fragments are within normal limits"
  )), .Names = "report", row.names = c(NA, 6L), class = "data.frame")

The input list:

c("active inflammation", "coeliac disease","Brunner's glands")

My intended output:

 Phrase                                                                                                                                  List sample
Biopsies of small bowel mucosa including Brunner's glands                                                                             active inflammation
These are fragments of small bowel mucosa which include Brunner's glands                                                              active inflammation,coeliac disease
These are fragments of small bowel mucosa which include Brunner's glands There is no evidence of coeliac disease in these biopsies    active inflammation
There is coeliac disease here.                                                                                                        Brunner's glands
Biopsies of specialisd gastric mucosa with moderate acute and active inflammation                                                     coeliac disease,Brunner's glands
These are fragments of small bowel mucosa. The small bowel fragments are within normal limits                                         active inflammation

I have tried

  Final$mine<-ifelse(grepl(paste(ListCheck, collapse='|'), Final[,1], ignore.case=TRUE),print("Check here"),sample(ListCheck,replace=T))

but this just checks whether any of the words in the list are present and if not picks a random word from the list.

Upvotes: 2

Views: 47

Answers (1)

erocoar
erocoar

Reputation: 5893

You could first check which inputs are not present, i.e. (calling your data df)

input_list <- c("active inflammation", "coeliac disease","Brunner's glands")
lst <- input_list[sapply(input_list, function(x) any(grepl(x, df$report)))]

Then to have a random number, use another sample for selecting the count per row

df$new <- sapply(1:nrow(df), function(x) {
  paste0(sample(lst, sample(1:length(lst), 1), replace = TRUE), collapse = ", ")
})

Upvotes: 1

Related Questions