stefan485
stefan485

Reputation: 39

How to extract the cell in R, which is just under the one being matched on with regex

my question is more about coding with r. I have the following mickey mouse type of data frame in R:

df <- data.frame(a=c(1:6), b=c("apple", "orange 1", "xxx", "lemon", "orange 2", "yyy"))

Goal: I would like to create a new variable "c" which has the values "xxx" and "yyy" in the 3rd and 6th row, respectively.

Caveat: I can not match on "xxx" and "yyy", just because it is impossible in my real data. Using regex my idea is to match on "orange" and then extract the data from the subsequent row.

I have tried:

regx <- "^orange\\s\\d+[\r\n]+(.*)"
df <- df %>%
  extract(b, "c", regx, remove=FALSE)

But it does not work, since a new row in R is not a New line or Carriage return, I guess.

The idea would be the following: I would like to detect the rows which include "orange", i.e.:

df[grepl("^orange\\s\\d+", df$b), ]

Then take the row numbers and tell R to extract the subsequent rows to create the new variable "c"

To make it more complicated my task is even more difficult actually!:

In the next task I have to extract all lines between "orange 1", "orange 2" ,..., "orange 10" and create a new variable similar like before.

Upvotes: 1

Views: 153

Answers (1)

akrun
akrun

Reputation: 887241

We can use str_detect to find the 'orange' elements in 'b' column as a logical vector, get the lag of that vector, use that in case_when to return the column 'b' or else return NA

library(dplyr)
library(stringr)
df %>%
    mutate(c = case_when(lag(str_detect(b,  "^orange\\s\\d+$"),
         default = FALSE) ~ as.character(b), TRUE ~ NA_character_))
# a        b    c
#1 1    apple <NA>
#2 2 orange 1 <NA>
#3 3      xxx  xxx
#4 4    lemon <NA>
#5 5 orange 2 <NA>
#6 6      yyy  yyy

Or in base R

i1 <- grep("^orange\\s*\\d+$", df$b) + 1
df$c[i1] <- as.character(df$b[i1])

Upvotes: 1

Related Questions