Reputation: 15
I want to transform an aggregated dataset into a new derived one with the individual instances that correspond to the initial aggregation. I loaded the dataset Titanic from R and looked at its data frame. I see that the frequencies of each tuple occurrence are aggregated. (For example, 20 Female Adult Crew Members survived the crash). I wanted to rebuild the dataset by replacing each tuple with the corresponding unaggregated tuples (e.g. 20 times the tuple "Crew, Female, Adult, Yes"). I know how to aggregate a dataset but I can not transform an already aggregated dataset. Any tip would be greatly appreciated.
library(dplyr)
library(purrr)
library(tidyr)
# keep data with frequency > 0
T = data.frame(Titanic, stringsAsFactors = F) %>% filter(Freq > 0)
tbl_df(T) %>% # tbl_df() only used to produce a more readable output (i.e. print only a few rows)
mutate(id = map(Freq, ~ 1:.)) %>% # create a vector of ids from 1 to Freq for each row
unnest(id) # expand the vector
# # A tibble: 2,201 x 6
# Class Sex Age Survived Freq id
# <fctr> <fctr> <fctr> <fctr> <dbl> <int>
# 1 3rd Male Child No 35 1
# 2 3rd Male Child No 35 2
# 3 3rd Male Child No 35 3
# 4 3rd Male Child No 35 4
# 5 3rd Male Child No 35 5
# 6 3rd Male Child No 35 6
# 7 3rd Male Child No 35 7
# 8 3rd Male Child No 35 8
# 9 3rd Male Child No 35 9
# 10 3rd Male Child No 35 10
# # ... with 2,191 more rows
You can remove the id column if you want. I left it there just to make it easier to see how the process works. You can also check that the number of rows of the new dataset is 2,201 which is equal to sum(T$Freq). So, as expected, the sum of frequencies of the original dataset is the number of rows of the new dataset.
Upvotes: 0
Views: 131
Reputation: 16121
library(dplyr)
library(purrr)
library(tidyr)
# keep data with frequency > 0
T = data.frame(Titanic, stringsAsFactors = F) %>% filter(Freq > 0)
tbl_df(T) %>% # tbl_df() only used to produce a more readable output (i.e. print only a few rows)
mutate(id = map(Freq, ~ 1:.)) %>% # create a vector of ids from 1 to Freq for each row
unnest(id) # expand the vector
# # A tibble: 2,201 x 6
# Class Sex Age Survived Freq id
# <fctr> <fctr> <fctr> <fctr> <dbl> <int>
# 1 3rd Male Child No 35 1
# 2 3rd Male Child No 35 2
# 3 3rd Male Child No 35 3
# 4 3rd Male Child No 35 4
# 5 3rd Male Child No 35 5
# 6 3rd Male Child No 35 6
# 7 3rd Male Child No 35 7
# 8 3rd Male Child No 35 8
# 9 3rd Male Child No 35 9
# 10 3rd Male Child No 35 10
# # ... with 2,191 more rows
You can remove the id
column if you want. I left it there just to make it easier to see how the process works.
You can also check that the number of rows of the new dataset is 2,201 which is equal to sum(T$Freq)
. So, as expected, the sum of frequencies of the original dataset is the number of rows of the new dataset.
Upvotes: 0