user21390049
user21390049

Reputation: 129

how to find unique characters both in forward and backward order in R

I have a list of characters like this:

list <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')

I want to have a list of unique characters with no more 'b_a', 'c_b'. I have tried unique() but it cannot remove 'b_a' and 'c_b'. I hope to receive some help about this. Many thanks!

Upvotes: 5

Views: 138

Answers (4)

Ma&#235;l
Ma&#235;l

Reputation: 52319

Another option would be to sort characters in each string of your list first, and remove duplicated entries:

l <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')

l[!duplicated(Tmisc::strSort(l))]
#[1] "a_b" "a_c" "a_d" "a_e" "b_c"

Yet another way to do it, using the base R utf8ToInt to sort strings:

l[!duplicated(lapply(l, \(x) sort(utf8ToInt(x))))]
#[1] "a_b" "a_c" "a_d" "a_e" "b_c"

Upvotes: 6

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

Borrowing data from @DaveArmstrong's solution, you can try

  • Option 1
with(
    read.table(text = l, sep = "_"),
    unique(paste(pmin(V1, V2), pmax(V1, V2), sep = "_"))
)
  • Option 2
idx <- seq_along(l) < match(l, sub("(\\w+)_(\\w+)", "\\2_\\1", l))
unique(l[replace(idx, is.na(idx), TRUE)])

which gives

[1] "a_b" "a_c" "a_d" "a_e" "b_c"

Upvotes: 4

SamR
SamR

Reputation: 20494

This is overkill for this simple example, but conceptually I would think about this as an undirected graph. We can use strcapture() to create a data frame from your vector l, and use igraph::graph_from_data_frame() to construct the graph:

library(igraph)
g <- strcapture("(.+)_(.+)", l, data.frame(x = character(), y = character())) |>
    graph_from_data_frame(directed = FALSE) |>
    simplify() # remove duplicate edges

If we plot(g) we'll see something like:

undirected graph

We can then extract the edges and paste() them together:

d <- as_data_frame(g, what="edges")
paste0(d$from, "_", d$to)
# [1] "a_b" "a_c" "a_d" "a_e" "b_c"

Upvotes: 6

DaveArmstrong
DaveArmstrong

Reputation: 22034

You could use strsplit() to split the two characters apart, then sort them in alphabetical order and paste them back together. That will turn "b_a" into "a_b". Then you could get the unique values of the sorted strings.

l <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')

ll <- strsplit(l, "_")
ll <- sapply(ll, \(x)paste(sort(x), collapse="_"))
unique(ll)
#> [1] "a_b" "a_c" "a_d" "a_e" "b_c"

Created on 2025-02-05 with reprex v2.1.1

Upvotes: 9

Related Questions