Reputation: 523
I am trying to extract columns from a DT to a new DT using select{dplyr}
extract_Data <- select(.data = master_merge, subjectID, activity_ID,
contains("mean\\(\\)"), contains("std\\(\\)"))
There are 563 columns so I am asking to extract the first and second column (subject, activity) and all other columns where mean() or std() is present.
There are NO duplicate columns that can be created here. so stumped as to the why. I have tried every variation of select but always Error: Duplicated Column name.
How can I troubleshoot this - I have gone through all 563 columns names and there are no duplicates.
Upvotes: 21
Views: 27188
Reputation: 8484
Based on Lantana great answer, here is a function for a pure dplyr
solution with pipe
integration :
validate.names = function(df){
rtn = df
valid_column_names = make.names(names=names(df), unique=TRUE, allow_ = TRUE)
names(rtn) = valid_column_names
rtn
}
You can then use it like this :
extract_Data %>% validate.names
Upvotes: 1
Reputation: 491
I was puzzled by the same error. Avoid using select. If meanStdcolumns is the list of columns containing mean or std (which you can get using grep), then master_merge[,meanStdcolumns] seems to work.
Upvotes: -1
Reputation: 129
Here is the solution I have found :
data <- data[ , !duplicated(colnames(data))]
This subsets the dataset without all the duplicated columns.
Hope it helps.
Upvotes: 12
Reputation: 31
Not a direct answer, but this will help a lot of people.
For all you Coursera students facing this problem with this dataset: there are duplicate column names. For example, 'fBodyAccJerk-bandsEnergy()-1,16' is found twice. Check:
your_merged_data_with_column_names[,400:420]
I'd love to show the output, but my browser won't support the 'code' button nor the ctrl-K shortcut and there's too much data to indent by hand. Try this code for yourself and carefully check the 'Variables not shown'!
I am working on a solution right now myself, possibly using the above answers, or the course forum.
Upvotes: 1
Reputation: 1
Before you assign the column names filter out the columns by getting a list of indices using
meanStdColumns <- grep("mean|std", features$V2, value = FALSE)
and then assign the columns names using
meanStdColumnsNames <- grep("mean|std", features$V2, value = TRUE)
Upvotes: -3
Reputation: 496
The root of the problem is invalid characters in the original column names. The discussion in Variable Name Restrictions in R applies to column names, too. Try forcing unique column names with valid characters, with make.names() .
valid_column_names <- make.names(names=names(master_merge), unique=TRUE, allow_ = TRUE)
names(master_merge) <- valid_column_names
Upvotes: 34
Reputation: 7232
Duplicates out of match filter can cause "duplicated name" error. Example:
library(dplyr)
x <- data.frame(1, 2, 3)
names(x) <- c("a", "a", "b")
x %>%
select(matches("b"))
If you don't need those columns, eliminate them with
x <- x[ !duplicated(names(x)) ]
Upvotes: 9