ShanZhengYang
ShanZhengYang

Reputation: 17651

Counting unique values based on two columns with repeated rows, R data frame

I have an R data frame with the following format:

column1    column2
NA         NA
1          A
1          A
1          A
NA         NA
NA         NA
2          B
2          B
NA         NA
NA         NA
3          A
3          A
3          A

df = structure(list(column1 = c(NA, 1L, 1L, 1L, NA, NA, 2L, 2L, NA, 
NA, 3L, 3L, 3L), column2 = c(NA, "A", "A", "A", NA, NA, "B", 
"B", NA, NA, "A", "A", "A")), .Names = c("column1", "column2"
), row.names = c(NA, -13L), class = "data.frame")

If the row in one column has an NA, the other column has an NA. The numerical value in column1 describes a unique group, e.g. rows 2-4 have the group 1. The column column2 describes the identity of this grouping. In this data frame, the identity is either A, B, C, or D.

My goal is to tally the number of identities by group within the entire data frame: how many A groups there are, how many B groups, etc.

The correct output for this file (so far) is there are 2 A groups and 1 B group.

How would I calculate this?

At the moment, I would try something like this:

length(df[df$column2 == "B"]) ## outputs 2 

but this is incorrect. If I combined column1 and column2, took only unique values 1A, 2B, 3A, I guess I could count how many times each label from column2 occurs?

(If it's easier, I'm happy to use data.table for this task.)

Upvotes: 2

Views: 4030

Answers (3)

COLO
COLO

Reputation: 1114

If you want to use data.table:

library(data.table)
setDT(df)

d <- df[!is.na(column1), list(n=.N), by=list(column2,column1)]
d <- d[, list(n=.N), by=list(column2)]
d
   column2 n
1:       A 2
2:       B 1

Or more concisely as a one-liner:

setDT(df)[!is.na(column1), .N, by = .(column2, column1)][, .N, by = column2]

Upvotes: 4

Frank
Frank

Reputation: 66819

You can use rle for runs and table for tabulation:

table(rle(df$column2)$values)

# A B 
# 2 1 

See ?rle and ?table for details.

Or, if you want to take advantage of column1 (which is derived from column2):

table(unique(df)$column2)

Upvotes: 5

rsoren
rsoren

Reputation: 4216

The 'dplyr' package has simple functions for this

library(dplyr)

df %>%
  filter(complete.cases(.) & !duplicated(.)) %>% 
  group_by(column2) %>%
  summarize(count = n())
  1. Filter out rows with NA
  2. Filter out duplicated rows; these represent individuals in the same group
  3. Group by the identity variable (column2)
  4. Count the number of unique groups (column1)

Upvotes: 4

Related Questions