Hendrik
Hendrik

Reputation: 1168

Extracting unique value sequences from DF column with R

I have the following data frame:

Col1 Col2
1    A
1    B
1    C
2    A
2    B
2    C
3    D
3    B
3    C
3    F
4    A
4    B
4    C

I'd like to extract unque sequence vectors (bus line stop sequences) from Col2 (actual stops of a particular bus route) where each sequence is defined by Col1 (respective bus route IDs) in R. The multiple occurence of identical sequences are unimportant. So, the desired outputs are:

A, B, C (in cases of Col1=1, 2 and 4) and D, B, C, F (in case of Col1=3)

Upvotes: 0

Views: 291

Answers (2)

User2321
User2321

Reputation: 3062

From your question I have understood that you want the unique sequences for each col1 id. In order to test I changed your data a bit (and I used the data.table package). What you could try is the following:

require(data.table)
df <- fread('Col1 Col2
              1    A
              1    B
              1    C
              2    A
              2    B
              2    C
              1    A
              1    B
              1    C
              3    D
              3    B
              3    C
              3    F
              1    A
              1    F
              1    C
              4    A
              4    B
              4    C')

In your case, if your data frame is called df just do setDT(df) to turn it into a data table. And from this data table select the unique sequences in Col2 by:

df[, .(list(Col2), Col1), by = rleid(Col1)][,.(Sequence = unique(V1)), by = Col1]

Which gives:

    Col1 Sequence
1:    1    A,B,C
2:    1    A,F,C
3:    2    A,B,C
4:    3  D,B,C,F
5:    4    A,B,C

What the command does is the following: Fist, for every ID in Col1 I get the sequence in Col2 (I use the rleid function to identify continuous IDs in Col1). Then, I select the unique sequences by each Col1 value.

Upvotes: 0

mpjdem
mpjdem

Reputation: 1544

You could split up the vector of bus stops according to the vector of route IDs. This will return a list of character vectors, on which you can call unique to remove the duplicated vectors (keeping the first occurrence).

Calling toString on each of these vectors through sapply will then convert the list of vectors to a vector of comma-separated strings.

res <- sapply(unique(split(df$Col2, df$Col1)), toString)
print(res)

Upvotes: 2

Related Questions