Reputation: 13
I have a large database (90,000 * 1500) sorted by child observations - which includes their mom's info. I want to sort the database according to mom's data.
The problem is that each kid only appears once in DB mom bs. It may appear up to 10 times.
In addition, I want the number of rows to be a number of different mothers (approx. 40,000) and a bit of data for each child - between 0-10.
For example, the DB I have and the DB I want to create:
Upvotes: 0
Views: 539
Reputation: 102625
Here is a solution similar to @Metariat, but with base R
, where ave()
is used
df$seq <- with(df,ave(id_num,mom,FUN = seq_along))
dfout <- reshape(df, idvar = "mom", timevar = "seq", direction = "wide")
such that
> dfout
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1 1 c1 1 12 c3 3 5
2 3 c2 2 15 c5 5 19
4 2 c4 4 8 <NA> NA NA
EDIT: If you have very big data frame, you can try the divide and conquer policy to see if it works
library(plyr)
dfs <- split(df,df$mom)
lst <- lapply(dfs, function(x) {
x <- within(x,seqnum <- ave(id_num,mom,FUN = seq_along))
reshape(x, idvar = "mom", timevar = "seqnum", direction = "wide")
}
)
dfout <- rbind.fill(lst)
Upvotes: 0
Reputation: 532
You could use reshape
library(data.table)
df = data.frame(
'c' = c('c1', 'c2', 'c3', 'c4', 'c5'),
'id_num' = seq(1,5),
'age' = c(12, 15, 5, 8, 19),
'mom'= c(1,3,1,2,3)
)
df
c id_num age mom
1 c1 1 12 1
2 c2 2 15 3
3 c3 3 5 1
4 c4 4 8 2
5 c5 5 19 3
df = setDT(df)[order(mom)]
df[, id_child := seq(.N), mom]
reshape(df, idvar = "mom", timevar = "id_child", direction = "wide")
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1: 1 c1 1 12 c3 3 5
2: 2 c4 4 8 <NA> NA NA
3: 3 c2 2 15 c5 5 19
Upvotes: 2
Reputation: 56
You can do this using the tidyr
package, with group_by
.
group_by(data, mom)
Then each mom contains a list of children. You can then sort the database as follows.
arrange(data, id_num, .by_group = TRUE)
To filter children between 0 and 10:
filter(data, age <= 10)
Upvotes: 0