user183974
user183974

Reputation: 183

Aggregating a dataset

I have a dataset of criminal offence history that is set out in the following way:

ID Charge  Chargedate VictimID ...
1  Robbery 2013-04-05  1  
1  Theft   2013-04-06  2 
1  Theft   2013-04-07  2
2  Homicide2013-04-08  3
2  Theft   2013-04-09  3
2  Burglary2013-04-10  3
...

I would like to reshape the dataset in two ways. First, I would like to reshape so that each row corresponds to a unique ID value, without victimID. I would also like to summarise the presence of charges by way of a count. e.g. rather than having 15 theft variables in the dataset, I would like to just have a theftcount variable with the value of 15.

e.g.

ID Robberycount Robberydate1 Theftcount Theftdate1 Theftdate2 ...
1  1             2013-04-05  2          2013-04-06 2013-04-07
2  0             NA          1          2013-04-09 NA      
...

The other dataset I would like to create involves reshaping the dataset but having each row correspond to each unique ID and victimID pair e.g.

ID VictimID Robberycount Robberydate1 Theftcount Theftdate1 Theftdate2 ...
1  1        1             2013-04-05  0          NA         NA
1  2        0             NA          2          2013-04-06 2013-04-07
2  3        0             NA          1          2013-04-09 NA
...

I have tried using the package Melt to do this, but I cannot seem to get the result that I want. In particular, I do not know how to make functions like dcast or melt aggregate the offence data and make dates specific to each charge. Is there a way to achieve what I want without resorting to manually sorting the dataset?

Upvotes: 1

Views: 63

Answers (1)

Tino
Tino

Reputation: 2101

You need to do this in two steps, hence tranform twice to wide. Therefore you have to prepare the two keys first. The ugly thing then is that you end up with more rows, which can be fixed with dplyr::summarise and unique (na.rm would be nice feature in unique here ;-)). Try this:

df <- read.table(text = "ID Charge  Chargedate VictimID
                 1  Robbery 2013-04-05  1  
                 1  Theft   2013-04-06  2 
                 1  Theft   2013-04-07  2
                 2  Homicide 2013-04-08  3
                 2  Theft   2013-04-09  3
                 2  Burglary 2013-04-10  3
                 ", header = TRUE, stringsAsFactors = FALSE)

library(dplyr)
library(tidyr)
# first data frame:
df %>%
  group_by(ID, Charge) %>% 
  mutate(key_date = paste0(Charge, "date", seq_len(n())),
         key_count = paste0(Charge, "count"),
         count = n()) %>% 
  ungroup() %>% 
  select(-Charge, -VictimID) %>% 
  spread(key = key_count, value = count, fill = 0) %>% 
  spread(key = key_date, value = Chargedate) %>% 
  group_by(ID) %>% 
  mutate_at(.vars = vars(matches("count$")), sum) %>% 
  summarise_all(.funs = function(x) {
    x <- unique(x[!is.na(x)])
    ifelse(length(x) == 0, NA_character_, x)
  })

# second data frame you asked for:
df %>%
  group_by(ID, Charge, VictimID) %>% 
  mutate(key_date = paste0(Charge, "date", seq_len(n())),
         key_count = paste0(Charge, "count"),
         count = n()) %>% 
  ungroup() %>% 
  select(-Charge) %>% 
  spread(key = key_count, value = count, fill = 0) %>% 
  spread(key = key_date, value = Chargedate) %>% 
  group_by(ID, VictimID) %>% 
  mutate_at(.vars = vars(matches("count$")), sum) %>% 
  summarise_all(.funs = function(x) {
    x <- unique(x[!is.na(x)])
    ifelse(length(x) == 0, NA_character_, x)
  })

Upvotes: 2

Related Questions