ZhouW
ZhouW

Reputation: 1207

Count number of unique values before a certain pattern?

I have a column in a data frame df$moves which looks like this:

W1.e4 B1.d5 W2.c4 B2.e6 W3.Nc3 B3.Nf6 W4.cxd5 B4.exd5 W5.Bg5 
W1.e4 B1.d5 W2.exd5 B2.Qxd5 W3.Nc3 B3.Qa5 W4.d4 B4.Nf6 W5.Nf3 B5.c6 W6.Ne5 B6.Bf5 
W1.e4 B1.e5 W2.Nf3 B2.Nc6 W3.Bc4
W1.e4 B1.e5 W2.Nf3 B2.Nf6
W1.e4 B1.c5 W2.Nf3

I want to get a count of all unique values before the character "W2." appears. In the above, for example, I'd expect the count of unique values before "W2." to be 1, being the last row only, as up until "W2." row 1 is the same as row 2 and row 3 is the same as row 4.

How should this be done?

Upvotes: 0

Views: 65

Answers (2)

Jaap
Jaap

Reputation: 83235

A possible approach is to extract the parts before W2:

# option 1:
vec <- substr(df$moves, 1, regexpr('W2\\.', df$moves) - 1)

# option 2:
vec <- sub('W2.*', '', df$moves)

and then see whether they are unique:

sum(!duplicated(vec) & !duplicated(vec, fromLast = TRUE))

which gives:

> sum(!duplicated(vec) & !duplicated(vec, fromLast = TRUE))
[1] 1

What this does:

  • regexpr('W2\\.', df$moves) extracts the positions where W2 first appears.
  • Substract 1 from those positions and feed the result to substr: substr(df$moves, 1, regexpr('W2\\.', df$moves) - 1) then gets the parts before W2.
  • An easier way to extract is using sub instead of a substr/regexpr-combo: sub('W2.*', '', df$moves).
  • !duplicated(vec) & !duplicated(vec, fromLast = TRUE) indicates which parts of vec are unique.
  • By wrapping that in sum you get the number of unique values before W2.

If you want to count the number of unique values instead of the values that only appear once, you can either do sum(!duplicated(vec)) of length(unique(vec))


Used data:

df <- structure(list(moves = c("W1.e4 B1.d5 W2.c4 B2.e6 W3.Nc3 B3.Nf6 W4.cxd5 B4.exd5 W5.Bg5", 
                               "W1.e4 B1.d5 W2.exd5 B2.Qxd5 W3.Nc3 B3.Qa5 W4.d4 B4.Nf6 W5.Nf3 B5.c6 W6.Ne5 B6.Bf5", 
                               "W1.e4 B1.e5 W2.Nf3 B2.Nc6 W3.Bc4", "W1.e4 B1.e5 W2.Nf3 B2.Nf6", "W1.e4 B1.c5 W2.Nf3")), 
                .Names = "moves", class = "data.frame", row.names = c(NA, -5L))

Upvotes: 3

MKR
MKR

Reputation: 20095

An option using strsplit with look-ahead split argument as split = " (?=W2\\.)" can be as:

length(unique(sapply(strsplit(df$Moves, split = " (?=W2\\.)", perl = TRUE), 
                                                       function(x)x[1])))

#[1] 3

# where the unique values are:
unique(sapply(strsplit(df$Moves, split = " (?=W2\\.)", perl = TRUE),
                                                       function(x)x[1]))
#[1] "W1.e4 B1.d5" "W1.e4 B1.e5" "W1.e4 B1.c5"

Regex:

" (?=W2\\.)"  -- space followed by W2.

Data:

df <- read.table(text = 
"Moves
'W1.e4 B1.d5 W2.c4 B2.e6 W3.Nc3 B3.Nf6 W4.cxd5 B4.exd5 W5.Bg5'
'W1.e4 B1.d5 W2.exd5 B2.Qxd5 W3.Nc3 B3.Qa5 W4.d4 B4.Nf6 W5.Nf3 B5.c6 W6.Ne5 B6.Bf5' 
'W1.e4 B1.e5 W2.Nf3 B2.Nc6 W3.Bc4'
'W1.e4 B1.e5 W2.Nf3 B2.Nf6'
'W1.e4 B1.c5 W2.Nf3'",
header = TRUE, stringsAsFactors = FALSE)

Upvotes: 0

Related Questions