Reputation: 43
I am trying to split a large data frame into smaller data frames based on a number of conditions in R. I would like each of these data frames to be named according to the variables on which they were split, but as there are over 1660 new "sub" data frames, I cannot do this manually.
An example of the whole data frame:
ID LENGTH GRADE CODE DURATION STATUS
1 1 A1 ABC Less than 10 Y
2 2 A1 ABC More than 10 Y
3 1 A1 DEF Less than 10 Y
4 2 A2 ABC Less than 10 Y
5 1 B1 ABC More than 10 Y
6 3 B2 DEF Less than 10 Y
There are over 900,000 entries to be split by 7 variables into about 1660 non-empty groups - I have found this by creating a new grouped dataframe
> Grouped_DF<- DF %>% group_by(LENGTH,GRADE,CODE,DURATION,STATUS,...)
> nrow(Grouped_Data)
[1] 1660
which consists of the groups I desire, but now I want to create a new dataframe for each of these groups, with all of the entries that fall into each group. I have tried using the split function:
SplitGroups<-split(DF, with(DF, interaction(LENGTH,GRADE,CODE,DURATION,STATUS,..)))
Which generates the following list:
> class(SplitGroups)
[1] "list"
> length(SplitGroups)
[1] 24480
An example of the output:
> SplitGroups
$1.A1.ABC.Less Than 10.N`
# A tibble: 10 x 65
# Groups: ID [10]
# ... with 65 variables:
Now I want to take the non empty dataframes, rename them as, for example, '1.A1.ABC.Less Than 10.N' (or something similar) and store this into the global environment.
I am aware this could be done using subset, for example:
1.A1.ABC.LessThan10.N <- subset(DF, LENGTH==1 & GRADE=="A1" & CODE=="ABC" & .....)
and so on, but this is not practical for the number of subsets needed.
Any help would be appreciated, thanks.
Upvotes: 3
Views: 2993
Reputation: 13581
I would also recommend keeping your data frames in a list, but here's a solution to what you wanted
Your data:
df <- data.frame(ID=1:6, LENGTH=c(1,2,1,2,1,3),
GRADE=c(rep("A1",3),"A2","B1","B2"),
CODE=c("ABC","ABC","DEF","ABC","ABC","DEF"),
DURATION=c("Less than 10", "More than 10", "Less than 10",
"Less than 10", "More than 10", "Less than 10"),
STATUS=rep("Y",6), stringsAsFactors=F)
Grouping your data by all columns and adding a column (newnames
)to create a unique variable name. *Note that I removed spaces from DURATION
using gsub
and prepended newnames
with a Z
because R
does not like variable names that start with a number:
grp.df <- df %>%
group_by_all() %>%
mutate(newnames=paste0("Z",ID,LENGTH,GRADE,CODE,gsub(" ", "", DURATION),STATUS))
Splitting your data frame into a list (as you've already done...)
split.df <- split(grp.df, grp.df$newnames)
Saving split data frames by new variable names using assign
for (I in 1:length(split.df)) { assign(unique(split.df[[I]]$newnames), split.df[[I]]) }
Upvotes: 1