Reputation: 1
I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:
x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")
The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:
partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col"
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"
My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1. Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.
Upvotes: 0
Views: 4399
Reputation: 3397
It seems that your colnames
are the standard ones of a data.frame, so to get just the column 1 you can do this:
df2 <- df[,1] #Where 1 can be changed to the number of column you wish.
There is no need to subset by a partial name. Although it is not recommended you could create a loop to do so:
for (i in ncol(x)){
assing(paste("df",i), x[,i]) #I use paste to get a different name for each column
}
Although the @paulhiemstra solution avoids the loop.
So with the new information then you can do as you wanted with grep, but specifically telling how many matches you expect:
df2<- x[,grep("1{30}", colnames(x))]
Upvotes: 1
Reputation: 60924
I think keeping the data in one data.frame
is the best option here. Either that, or put it into a list of data.frame
's. This makes it easy to extract summary statistics per individual much easier.
First create some example data:
df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50),
sample(names_variables, 50, TRUE),
sep = '')
colnames(df) = column_names
What I would do first is use melt
to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:
library(reshape2)
df_melt = melt(df)
head(df_melt)
variable value
1 85ham 0.83619111
2 85ham 0.08503596
3 85ham 0.54599402
4 85ham 0.42579376
5 85ham 0.68702319
6 85ham 0.88642715
Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:
library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
var_name = gsub('[0-9]', '', variable))
essentially removing the part of the string not needed. Now we can do nice things like:
mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name),
mean(value))
head(mean_per_indivdual_per_var)
individual_ID var_name mean(value)
1 63 spam 0.4840511
2 46 ham 0.4979884
3 20 shrub 0.5094550
4 90 ham 0.5550148
5 30 shrub 0.4233039
6 21 ham 0.4764298
Upvotes: 1