Reputation: 97

Index of non-unique element in data frame

How can I extract the column names (or row and column index) of duplicate element in next data frame?

            V1         V2           V3           V4
PC1  0.5863431  0.5863431 3.952237e-01 3.952237e-01
PC2 -0.3952237 -0.3952237 5.863431e-01 5.863431e-01
PC3 -0.7071068  0.7071068 1.665335e-16 3.885781e-16

For example 0.5863431 is equal to 0.5863431, so "V1" and "V2" are the column names.

In that dataframe I want to get:

[1] "V1" "V2" "V3" "V4"

As you can see, looking rather only the result of the first row.

Second example:

            V1         V2          V3         V4
PC1 -0.5987139 -0.5987139 -0.03790446  0.5307039
PC2 -0.0189601 -0.0189601 -0.99315168 -0.1137136
PC3  0.3986891  0.3523926 -0.11045319  0.8394442

Result:

[1] "V1" "V2"

Upvotes: 6

Answers (3)

Retired Data Munger

Reputation: 1445

With whatever approach you use, be aware of FAQ 7.31 when working with floating point numbers. You may want to create a new matrix where you have 'rounded' them to the same number of digits; though they may 'look' the same on the printout, there can be differences that you don't see in the trailing digits.

Upvotes: 1

Rich Scriven

Reputation: 99341

There may be a better way, but here's my take on it.

## coerce to matrix (if not already)
m <- as.matrix(df)
## find duplicates across both margins
d <- duplicated(m, MARGIN = 0) | duplicated(m, MARGIN = 0, fromLast = TRUE)
## grab the unique col names
colnames(m)[unique(col(d)[d])]

Examples: On your first data frame -

df1 <- read.table(text = "V1         V2           V3           V4
PC1  0.5863431  0.5863431 3.952237e-01 3.952237e-01
PC2 -0.3952237 -0.3952237 5.863431e-01 5.863431e-01
PC3 -0.7071068  0.7071068 1.665335e-16 3.885781e-16", header = TRUE)

m1 <- as.matrix(df1)
d1 <- duplicated(m1, MARGIN = 0) | duplicated(m1, MARGIN = 0, fromLast = TRUE)
colnames(m1)[unique(col(d1)[d1])]
# [1] "V1" "V2" "V3" "V4"

And on the second -

df2 <- read.table(text = "V1         V2          V3         V4
PC1 -0.5987139 -0.5987139 -0.03790446  0.5307039
PC2 -0.0189601 -0.0189601 -0.99315168 -0.1137136
PC3  0.3986891  0.3523926 -0.11045319  0.8394442", header = TRUE)

m2 <- as.matrix(df2)
d2 <- duplicated(m2, MARGIN = 0) | duplicated(m2, MARGIN = 0, fromLast = TRUE)
colnames(m2)[unique(col(d2)[d2])]
# [1] "V1" "V2"

Side note: Since your data contains all numeric values I would recommend beginning with a matrix instead of a data frame.

Upvotes: 8

Jota

Reputation: 17611

A slightly different approach using which and apply

# convert to matrix
mat1 <- as.matrix(df1)
# find duplicates and store them
dups <- mat1[which(duplicated(c(mat1)))]
# identify columns containing a value in dups
names(which(apply(mat1, 2, function(x) any(x %in% dups))))
#[1] "V1" "V2" "V3" "V4"

mat2 <- as.matrix(df2)
dups <- mat2[which(duplicated(c(mat2)))]
names(which(apply(mat2, 2, function(x) any(x %in% dups))))
#[1] "V1" "V2"

Upvotes: 3

Index of non-unique element in data frame

Answers (3)

Related Questions