Reputation: 53
I have a column in the df which has values like this:
and so on..
I would like to identify all values in bold, i.e. all values in the column that have any character after the underscore and at the same time appear without the underscore. I have tried using gsub to get a list of values in a separate dataframe but that still does not solve the problem. Any help will be appreciated!
Upvotes: 1
Views: 70
Reputation: 389265
Base R solution :
First remove everything after underscore so that we can compare similar strings.
x <- c('92030534-12835', '92030534-12835_2', '92030534-12835_3', '13212854-14382', '13668582-14232', '93265773-15302', '93265773-15302_2')
df <- data.frame(x)
df$y <- sub('_.*', '', df$x)
df
# x y
#1 92030534-12835 92030534-12835
#2 92030534-12835_2 92030534-12835
#3 92030534-12835_3 92030534-12835
#4 13212854-14382 13212854-14382
#5 13668582-14232 13668582-14232
#6 93265773-15302 93265773-15302
#7 93265773-15302_2 93265773-15302
Using duplicated
would identify all the strings that repeat.
duplicated(df$y) | duplicated(df$y, fromLast = TRUE)
[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE
You can then subset those rows.
df[duplicated(df$y) | duplicated(df$y, fromLast = TRUE), ]
# x y
#1 92030534-12835 92030534-12835
#2 92030534-12835_2 92030534-12835
#3 92030534-12835_3 92030534-12835
#6 93265773-15302 93265773-15302
#7 93265773-15302_2 93265773-15302
Or add them as a new column
df$z <- duplicated(df$y) | duplicated(df$y, fromLast = TRUE)
Upvotes: 1
Reputation: 1495
Here's a data.table
approach
library(data.table)
# Mimic your dataset
dat = data.frame(`Claim Number` = c("1-12835", "1-12835_2",
"1-12835_3", "1-12835_4", "2", "3", "4", "5", "6-15302",
"6-15302_2", "7", "8", "9-16186", "9-16186_2"))
# Set the data.frame to data.table
setDT(dat)
# Get the "parent" claim number by removing any characters after the underscore
dat[, parent_claim_number := gsub("_.*", "", Claim.Number)]
# Add an indicator for any parent claim numbers with "sub" claims
dat[, has_sub_claim := any(grepl("_", Claim.Number)), by = .(parent_claim_number)]
Result is:
Claim.Number parent_claim_number has_sub_claim
1: 1-12835 1-12835 TRUE
2: 1-12835_2 1-12835 TRUE
3: 1-12835_3 1-12835 TRUE
4: 1-12835_4 1-12835 TRUE
5: 2 2 FALSE
6: 3 3 FALSE
7: 4 4 FALSE
8: 5 5 FALSE
9: 6-15302 6-15302 TRUE
10: 6-15302_2 6-15302 TRUE
11: 7 7 FALSE
12: 8 8 FALSE
13: 9-16186 9-16186 TRUE
14: 9-16186_2 9-16186 TRUE
If you want claims with a sub claim, you can do:
dat[has_sub_claim == TRUE]
If you want only the sub-claims without the parent claim, you can do:
dat[has_sub_claim == TRUE & grepl("_", Claim.Number)]
Upvotes: 0
Reputation: 2588
Here is a dirty solution.
# Prepare some data
dat <- c("1-1", "1-1_2", "1-2", "1-1_3", "1-3", "1-4_1")
# Get splited strings
dat_mask <- sapply(dat, function(x) {
string_l <- strsplit(x, "_")[[1]]
# Here the string_l has length 2 if the initial string contains "_"
return(c(string_l[1], length(string_l) > 1))
})
# Get parts which came from a string with underscore
str_parts <- unique(dat_mask[1, ][as.logical(dat_mask[2, ])])
# Get indexes of selected strings
str_ind <- dat_mask[1, ] %in% str_parts
# Get values form initial data
dat[str_ind]
# [1] "1-1" "1-1_2" "1-1_3" "1-4_1"
Upvotes: 0