Reputation: 53

Identify partial string matches

I have a column in the df which has values like this:

and so on..

I would like to identify all values in bold, i.e. all values in the column that have any character after the underscore and at the same time appear without the underscore. I have tried using gsub to get a list of values in a separate dataframe but that still does not solve the problem. Any help will be appreciated!

Upvotes: 1

Answers (3)

Ronak Shah

Reputation: 389265

Base R solution :

First remove everything after underscore so that we can compare similar strings.

x <- c('92030534-12835', '92030534-12835_2', '92030534-12835_3', '13212854-14382', '13668582-14232', '93265773-15302', '93265773-15302_2')
df <- data.frame(x)
df$y <- sub('_.*', '', df$x)
df
#                 x              y
#1   92030534-12835 92030534-12835
#2 92030534-12835_2 92030534-12835
#3 92030534-12835_3 92030534-12835
#4   13212854-14382 13212854-14382
#5   13668582-14232 13668582-14232
#6   93265773-15302 93265773-15302
#7 93265773-15302_2 93265773-15302

Using duplicated would identify all the strings that repeat.

duplicated(df$y) | duplicated(df$y, fromLast = TRUE)
[1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

You can then subset those rows.

df[duplicated(df$y) | duplicated(df$y, fromLast = TRUE), ]

#                 x              y
#1   92030534-12835 92030534-12835
#2 92030534-12835_2 92030534-12835
#3 92030534-12835_3 92030534-12835
#6   93265773-15302 93265773-15302
#7 93265773-15302_2 93265773-15302

Or add them as a new column

df$z <- duplicated(df$y) | duplicated(df$y, fromLast = TRUE)

Upvotes: 1

jav

Reputation: 1495

Here's a data.table approach

library(data.table)

# Mimic your dataset
dat = data.frame(`Claim Number` = c("1-12835", "1-12835_2", 
"1-12835_3", "1-12835_4", "2", "3", "4", "5", "6-15302", 
"6-15302_2", "7", "8", "9-16186", "9-16186_2"))

# Set the data.frame to data.table
setDT(dat)

# Get the "parent" claim number by removing any characters after the underscore
dat[, parent_claim_number := gsub("_.*", "", Claim.Number)]

# Add an indicator for any parent claim numbers with "sub" claims
dat[, has_sub_claim := any(grepl("_", Claim.Number)), by = .(parent_claim_number)]

Result is:

   Claim.Number parent_claim_number has_sub_claim
 1:      1-12835             1-12835          TRUE
 2:    1-12835_2             1-12835          TRUE
 3:    1-12835_3             1-12835          TRUE
 4:    1-12835_4             1-12835          TRUE
 5:            2                   2         FALSE
 6:            3                   3         FALSE
 7:            4                   4         FALSE
 8:            5                   5         FALSE
 9:      6-15302             6-15302          TRUE
10:    6-15302_2             6-15302          TRUE
11:            7                   7         FALSE
12:            8                   8         FALSE
13:      9-16186             9-16186          TRUE
14:    9-16186_2             9-16186          TRUE

If you want claims with a sub claim, you can do:

dat[has_sub_claim == TRUE]

If you want only the sub-claims without the parent claim, you can do:

dat[has_sub_claim == TRUE & grepl("_", Claim.Number)]

Upvotes: 0

Istrel

Reputation: 2588

Here is a dirty solution.

# Prepare some data
dat <- c("1-1", "1-1_2", "1-2", "1-1_3", "1-3", "1-4_1")

# Get splited strings
dat_mask <- sapply(dat, function(x) {
    string_l <- strsplit(x, "_")[[1]]
    # Here the string_l has length 2 if the initial string contains "_"
    return(c(string_l[1], length(string_l) > 1))
})

# Get parts which came from a string with underscore
str_parts <- unique(dat_mask[1, ][as.logical(dat_mask[2, ])])

# Get indexes of selected strings
str_ind <- dat_mask[1, ] %in% str_parts

# Get values form initial data
dat[str_ind]

# [1] "1-1"   "1-1_2" "1-1_3" "1-4_1"

Upvotes: 0

Identify partial string matches

Answers (3)

Related Questions