Reputation: 39
my question is more about coding with r. I have the following mickey mouse type of data frame in R:
df <- data.frame(a=c(1:6), b=c("apple", "orange 1", "xxx", "lemon", "orange 2", "yyy"))
Goal: I would like to create a new variable "c" which has the values "xxx" and "yyy" in the 3rd and 6th row, respectively.
Caveat: I can not match on "xxx" and "yyy", just because it is impossible in my real data. Using regex my idea is to match on "orange" and then extract the data from the subsequent row.
I have tried:
regx <- "^orange\\s\\d+[\r\n]+(.*)"
df <- df %>%
extract(b, "c", regx, remove=FALSE)
But it does not work, since a new row in R is not a New line or Carriage return, I guess.
The idea would be the following: I would like to detect the rows which include "orange", i.e.:
df[grepl("^orange\\s\\d+", df$b), ]
Then take the row numbers and tell R to extract the subsequent rows to create the new variable "c"
To make it more complicated my task is even more difficult actually!:
In the next task I have to extract all lines between "orange 1", "orange 2" ,..., "orange 10" and create a new variable similar like before.
Upvotes: 1
Views: 153
Reputation: 887241
We can use str_detect
to find the 'orange' elements in 'b' column as a logical vector, get the lag
of that vector, use that in case_when
to return the column 'b' or else return NA
library(dplyr)
library(stringr)
df %>%
mutate(c = case_when(lag(str_detect(b, "^orange\\s\\d+$"),
default = FALSE) ~ as.character(b), TRUE ~ NA_character_))
# a b c
#1 1 apple <NA>
#2 2 orange 1 <NA>
#3 3 xxx xxx
#4 4 lemon <NA>
#5 5 orange 2 <NA>
#6 6 yyy yyy
Or in base R
i1 <- grep("^orange\\s*\\d+$", df$b) + 1
df$c[i1] <- as.character(df$b[i1])
Upvotes: 1