Reputation: 1

Subset data frame based on first letters of column name

I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:

x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")

The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:

partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col" 
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"

My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1. Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.

Upvotes: 0

Answers (2)

llrs

Reputation: 3397

It seems that your colnames are the standard ones of a data.frame, so to get just the column 1 you can do this:

df2 <- df[,1] #Where 1 can be changed to the number of column you wish.

There is no need to subset by a partial name. Although it is not recommended you could create a loop to do so:

for (i in ncol(x)){
  assing(paste("df",i), x[,i]) #I use paste to get a different name for each column
}

Although the @paulhiemstra solution avoids the loop.

So with the new information then you can do as you wanted with grep, but specifically telling how many matches you expect:

df2<- x[,grep("1{30}", colnames(x))]

Upvotes: 1

Paul Hiemstra

Reputation: 60984

I think keeping the data in one data.frame is the best option here. Either that, or put it into a list of data.frame's. This makes it easy to extract summary statistics per individual much easier.

First create some example data:

df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50), 
                     sample(names_variables, 50, TRUE), 
                     sep = '')
colnames(df) = column_names

What I would do first is use melt to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:

library(reshape2)
df_melt = melt(df)
head(df_melt)
  variable      value
1    85ham 0.83619111
2    85ham 0.08503596
3    85ham 0.54599402
4    85ham 0.42579376
5    85ham 0.68702319
6    85ham 0.88642715

Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:

library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
                          var_name = gsub('[0-9]', '', variable))

essentially removing the part of the string not needed. Now we can do nice things like:

mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name), 
                                       mean(value))
head(mean_per_indivdual_per_var)
  individual_ID var_name mean(value)
1            63     spam   0.4840511
2            46      ham   0.4979884
3            20    shrub   0.5094550
4            90      ham   0.5550148
5            30    shrub   0.4233039
6            21      ham   0.4764298

Upvotes: 1

Subset data frame based on first letters of column name

Answers (2)

Related Questions