Michael
Michael

Reputation: 345

Matching numbers to characters separated by a space in R

I have a data frame like so:

| X | Y 
----------
| 8  | 8 13
| 1  | 8 13  
| 5  | 8 13 
| 13 | 8 13
| 3  | 8 13
| 14 | 7
| 14 | 

Col X is in integer format. Col Y is in character format, the numbers are separated by a space. This column can have one number (single or double digit), up to 20 numbers, all separated by a space, or it can be blank.

I wish to match the number in Col X with any of the number in Col Y, like so:

| X  | Y    | FLAG
---------------
| 8  | 8 13 | Yes <-- 8 is a match
| 1  | 8 13 | No <-- Even though 1 appears in Col Y, it is technically number 13 
| 5  | 8 13 | No
| 13 | 8 13 | Yes
| 3  | 8 13 | No <-- Even though 3 appears in Col Y, it is technically number 13 
| 14 | 7    | No
| 14 |      | No

I tried to unlist the numbers like so:

as.numeric(unlist(str_extract_all(data$colY, "[\\.0-9e-]+")

But this only separates the numbers. How can this be achieved?

Upvotes: 2

Views: 410

Answers (2)

akrun
akrun

Reputation: 887291

We could use str_detect which is Vectorized over the 'string' and 'pattern'. Inorder to make sure no partial matches, paste word boundary (\\b) on the 'X' column

library(stringr)
library(dplyr)
df1 %>%
    mutate(FLAG = c("No", "Yes")[1 + str_detect(Y, str_c("\\b", X, "\\b"))])

-output

#   X    Y FLAG
#1  8 8 13  Yes
#2  1 8 13   No
#3  5 8 13   No
#4 13 8 13  Yes
#5  3 8 13   No
#6 14    7   No
#7 14        No

In base R, we could use read.table to read as numeric columns and then with rowSums create a logical vector

c("No", "Yes")[1 + (rowSums(read.table(text = df1$Y, header = FALSE,
     fill = TRUE) == df1$X, na.rm = TRUE) > 0)]

data

df1 <- structure(list(X = c(8, 1, 5, 13, 3, 14, 14), Y = c("8 13", "8 13", 
"8 13", "8 13", "8 13", "7", "")), class = "data.frame", row.names = c(NA, 
-7L))

Upvotes: 2

thelatemail
thelatemail

Reputation: 93908

strsplit then check if the value is in the corresponding list. Let's remake the data first:

dat <- data.frame(X=c(8L, 1L, 5L, 13L, 3L, 14L),
                  Y=c("8 13", "8 13", "8 13", "8 13", "8 13", "7"), stringsAsFactors=FALSE)

Then do the comparison:

mapply(`%in%`, dat$X, strsplit(dat$Y, "\\s+"))
#[1]  TRUE FALSE FALSE  TRUE FALSE FALSE

Upvotes: 3

Related Questions