scopa
scopa

Reputation: 523

dplyr "Select" - Error: found duplicated column name

I am trying to extract columns from a DT to a new DT using select{dplyr}

extract_Data <- select(.data = master_merge, subjectID, activity_ID,
                           contains("mean\\(\\)"), contains("std\\(\\)"))

There are 563 columns so I am asking to extract the first and second column (subject, activity) and all other columns where mean() or std() is present.

There are NO duplicate columns that can be created here. so stumped as to the why. I have tried every variation of select but always Error: Duplicated Column name.

How can I troubleshoot this - I have gone through all 563 columns names and there are no duplicates.

Upvotes: 21

Views: 27188

Answers (7)

Dan Chaltiel
Dan Chaltiel

Reputation: 8484

Based on Lantana great answer, here is a function for a pure dplyr solution with pipe integration :

validate.names = function(df){
  rtn = df
  valid_column_names = make.names(names=names(df), unique=TRUE, allow_ = TRUE)
  names(rtn) = valid_column_names
  rtn
}

You can then use it like this :

extract_Data %>% validate.names

Upvotes: 1

usr0192
usr0192

Reputation: 491

I was puzzled by the same error. Avoid using select. If meanStdcolumns is the list of columns containing mean or std (which you can get using grep), then master_merge[,meanStdcolumns] seems to work.

Upvotes: -1

laurent
laurent

Reputation: 129

Here is the solution I have found :

data <- data[ , !duplicated(colnames(data))]

This subsets the dataset without all the duplicated columns.

Hope it helps.

Upvotes: 12

Wil
Wil

Reputation: 31

Not a direct answer, but this will help a lot of people.

For all you Coursera students facing this problem with this dataset: there are duplicate column names. For example, 'fBodyAccJerk-bandsEnergy()-1,16' is found twice. Check:

your_merged_data_with_column_names[,400:420]

I'd love to show the output, but my browser won't support the 'code' button nor the ctrl-K shortcut and there's too much data to indent by hand. Try this code for yourself and carefully check the 'Variables not shown'!

I am working on a solution right now myself, possibly using the above answers, or the course forum.

Upvotes: 1

Vare Vadal
Vare Vadal

Reputation: 1

Before you assign the column names filter out the columns by getting a list of indices using

meanStdColumns <- grep("mean|std", features$V2, value = FALSE)

and then assign the columns names using

meanStdColumnsNames <- grep("mean|std", features$V2, value = TRUE)

Upvotes: -3

Lantana
Lantana

Reputation: 496

The root of the problem is invalid characters in the original column names. The discussion in Variable Name Restrictions in R applies to column names, too. Try forcing unique column names with valid characters, with make.names() .

valid_column_names <- make.names(names=names(master_merge), unique=TRUE, allow_ = TRUE)
names(master_merge) <- valid_column_names

Upvotes: 34

bergant
bergant

Reputation: 7232

Duplicates out of match filter can cause "duplicated name" error. Example:

library(dplyr)
x <- data.frame(1, 2, 3)
names(x) <- c("a", "a", "b")

x %>%
  select(matches("b"))

If you don't need those columns, eliminate them with

x <- x[ !duplicated(names(x)) ]

Upvotes: 9

Related Questions