How to separate a dataframe into a list of dataframes regarding column name in R?

Question

Suppose I have the following dataframe:

df <- data.frame(BR.a=rnorm(10), BR.b=rnorm(10), BR.c=rnorm(10),
USA.a=rnorm(10), USA.b = rnorm(10), FRA.a=rnorm(10), FRA.b=rnorm(10))

I want to create a list of dataframes, separating them by the first part of the column name, i.e., the columns that start with "BR" would be one element of the list, the columns that start with "USA" would be another, and so on.

I'm able to get the column names and separate them using strsplit. However I'm not sure how would be the best way to iterate over it and separate the dataframe.

strsplit(names(df), "\.")

gives me a list which the top level elements are the names of the columns and the second level are this sames split by ".".

How could I iterate this list in order to get the index number of the columns that starts with the same substring, and them group those columns as elements of another list?

Matt Parker · Accepted Answer

Dason beat me to it, but here's a different flavor of the same conceptual approach:

library(plyr)

# Use regex to get the prefixes
# Pulls any letters or digits ("\w*") from the beginning of the string ("^")
# to the first period ("\.") into a group, then matches all the remaining
# characters (".*").  Then replaces with the first group ("\1" = "(\w*)").
# In other words, it matches the whole string but replaces with only the prefix.

prefixes <- unique(gsub(pattern = "^(\w*)\..*",
                        replace = "\1",
                        x = names(df)))

# Subset to the variables that match the prefix
# Iterates over the prefixes and subsets based on the variable names that
# match that prefix
llply(prefixes, .fun = function(x){
    y <- subset(df, select = names(df)[grep(names(df),
                                            pattern = paste("^", x, sep = ""))])
})

I think these regexes should still give you the right results even if there are "." later in variable names:

unique(gsub(pattern = "^(\w*)\..*",
            replace = "\1",
            x = c(names(df), "FRA.c.blahblah")))

Or if a prefix appears later in a variable name:

# Add a USA variable with "FRA" in it
df2 <- data.frame(df, USA.FRANKLINS = rnorm(10))

prefixes2 <- unique(gsub(pattern = "^(\w*)\..*",
                        replace = "\1",
                        x = names(df2)))

llply(prefixes2, .fun = function(x){
    y <- subset(df2, select = names(df2)[grep(names(df2),
                                            pattern = paste("^", x, sep = ""))])
})

How to separate a dataframe into a list of dataframes regarding column name in R?

Answers (2)

Related Questions