Reputation: 523

dplyr "Select" - Error: found duplicated column name

I am trying to extract columns from a DT to a new DT using select{dplyr}

extract_Data <- select(.data = master_merge, subjectID, activity_ID,
                           contains("mean\\(\\)"), contains("std\\(\\)"))

There are 563 columns so I am asking to extract the first and second column (subject, activity) and all other columns where mean() or std() is present.

There are NO duplicate columns that can be created here. so stumped as to the why. I have tried every variation of select but always Error: Duplicated Column name.

How can I troubleshoot this - I have gone through all 563 columns names and there are no duplicates.

Upvotes: 21

Answers (7)

Dan Chaltiel

Reputation: 8523

Based on Lantana great answer, here is a function for a pure dplyr solution with pipe integration :

validate.names = function(df){
  rtn = df
  valid_column_names = make.names(names=names(df), unique=TRUE, allow_ = TRUE)
  names(rtn) = valid_column_names
  rtn
}

You can then use it like this :

extract_Data %>% validate.names

Upvotes: 1

usr0192

Reputation: 491

I was puzzled by the same error. Avoid using select. If meanStdcolumns is the list of columns containing mean or std (which you can get using grep), then master_merge[,meanStdcolumns] seems to work.

Upvotes: -1

laurent

Reputation: 129

Here is the solution I have found :

data <- data[ , !duplicated(colnames(data))]

This subsets the dataset without all the duplicated columns.

Hope it helps.

Upvotes: 12

Wil

Reputation: 31

Not a direct answer, but this will help a lot of people.

For all you Coursera students facing this problem with this dataset: there are duplicate column names. For example, 'fBodyAccJerk-bandsEnergy()-1,16' is found twice. Check:

your_merged_data_with_column_names[,400:420]

I'd love to show the output, but my browser won't support the 'code' button nor the ctrl-K shortcut and there's too much data to indent by hand. Try this code for yourself and carefully check the 'Variables not shown'!

I am working on a solution right now myself, possibly using the above answers, or the course forum.

Upvotes: 1

Vare Vadal

Reputation: 1

Before you assign the column names filter out the columns by getting a list of indices using

meanStdColumns <- grep("mean|std", features$V2, value = FALSE)

and then assign the columns names using

meanStdColumnsNames <- grep("mean|std", features$V2, value = TRUE)

Upvotes: -3

Lantana

Reputation: 496

The root of the problem is invalid characters in the original column names. The discussion in Variable Name Restrictions in R applies to column names, too. Try forcing unique column names with valid characters, with make.names() .

valid_column_names <- make.names(names=names(master_merge), unique=TRUE, allow_ = TRUE)
names(master_merge) <- valid_column_names

Upvotes: 34

bergant

Reputation: 7232

Duplicates out of match filter can cause "duplicated name" error. Example:

library(dplyr)
x <- data.frame(1, 2, 3)
names(x) <- c("a", "a", "b")

x %>%
  select(matches("b"))

If you don't need those columns, eliminate them with

x <- x[ !duplicated(names(x)) ]

Upvotes: 9

dplyr &quot;Select&quot; - Error: found duplicated column name

Answers (7)

Related Questions

dplyr "Select" - Error: found duplicated column name