gtcodes
gtcodes

Reputation: 15

Transform a dataset

I want to transform an aggregated dataset into a new derived one with the individual instances that correspond to the initial aggregation. I loaded the dataset Titanic from R and looked at its data frame. I see that the frequencies of each tuple occurrence are aggregated. (For example, 20 Female Adult Crew Members survived the crash). I wanted to rebuild the dataset by replacing each tuple with the corresponding unaggregated tuples (e.g. 20 times the tuple "Crew, Female, Adult, Yes"). I know how to aggregate a dataset but I can not transform an already aggregated dataset. Any tip would be greatly appreciated.

library(dplyr)
library(purrr)
library(tidyr)

# keep data with frequency > 0
T = data.frame(Titanic, stringsAsFactors = F) %>% filter(Freq > 0)

tbl_df(T) %>%                           # tbl_df() only used to produce a more readable output (i.e. print only a few rows)
  mutate(id = map(Freq, ~ 1:.)) %>%     # create a vector of ids from 1 to Freq for each row
  unnest(id)                            # expand the vector

# # A tibble: 2,201 x 6
#    Class    Sex    Age Survived  Freq    id
#   <fctr> <fctr> <fctr>   <fctr> <dbl> <int>
# 1    3rd   Male  Child       No    35     1
# 2    3rd   Male  Child       No    35     2
# 3    3rd   Male  Child       No    35     3
# 4    3rd   Male  Child       No    35     4
# 5    3rd   Male  Child       No    35     5
# 6    3rd   Male  Child       No    35     6
# 7    3rd   Male  Child       No    35     7
# 8    3rd   Male  Child       No    35     8
# 9    3rd   Male  Child       No    35     9
# 10   3rd   Male  Child       No    35    10
# # ... with 2,191 more rows

You can remove the id column if you want. I left it there just to make it easier to see how the process works. You can also check that the number of rows of the new dataset is 2,201 which is equal to sum(T$Freq). So, as expected, the sum of frequencies of the original dataset is the number of rows of the new dataset.

Upvotes: 0

Views: 131

Answers (1)

AntoniosK
AntoniosK

Reputation: 16121

library(dplyr)
library(purrr)
library(tidyr)

# keep data with frequency > 0
T = data.frame(Titanic, stringsAsFactors = F) %>% filter(Freq > 0)

tbl_df(T) %>%                           # tbl_df() only used to produce a more readable output (i.e. print only a few rows)
  mutate(id = map(Freq, ~ 1:.)) %>%     # create a vector of ids from 1 to Freq for each row
  unnest(id)                            # expand the vector

# # A tibble: 2,201 x 6
#    Class    Sex    Age Survived  Freq    id
#   <fctr> <fctr> <fctr>   <fctr> <dbl> <int>
# 1    3rd   Male  Child       No    35     1
# 2    3rd   Male  Child       No    35     2
# 3    3rd   Male  Child       No    35     3
# 4    3rd   Male  Child       No    35     4
# 5    3rd   Male  Child       No    35     5
# 6    3rd   Male  Child       No    35     6
# 7    3rd   Male  Child       No    35     7
# 8    3rd   Male  Child       No    35     8
# 9    3rd   Male  Child       No    35     9
# 10   3rd   Male  Child       No    35    10
# # ... with 2,191 more rows

You can remove the id column if you want. I left it there just to make it easier to see how the process works.

You can also check that the number of rows of the new dataset is 2,201 which is equal to sum(T$Freq). So, as expected, the sum of frequencies of the original dataset is the number of rows of the new dataset.

Upvotes: 0

Related Questions