ccc
ccc

Reputation: 303

Complex data frame and transpose the data

I have a data frame as shown below:

ID  Capital  Instal  Date1 Date2
2    500      25      a     b
2    500      20      a     c
2    450      15      a     a
2    300      10      a     f
2    250       0      a     z
4    100      25      b     a
4    90       20      b     b
4    80       15      b     a
4    75       10      b     f
4    25        0      b     z

I want to create a new data.frame from this, where if Date1=Date2 then my new data frame B would look like the one below:

ID  Date1  Capital  Instal1  Instal2  Instal3  Instal4
2     a     450       15       10       0
4     b      90       20       15       10       0

So I want the new data.frame to only consider the data after Date1 and Date2 are equal.

Upvotes: 4

Views: 61

Answers (3)

moodymudskipper
moodymudskipper

Reputation: 47320

tidyverse

Here is a tidyverse approach (dplyr + tidyr):

library(tidyverse)
df2 <- df %>%
  group_by(ID) %>%
  filter(cumsum(Date1 == Date2) >0) %>%
  transmute(Capital=Capital[1],Instal,Date1,colnames = paste0("Instal",seq(n()))) %>%
  ungroup %>%
  spread(colnames,Instal)

df2[is.na(df2)] <- 0 # omit if you'd rather have NA
# # A tibble: 2 x 7
#      ID Capital Date1 Instal1 Instal2 Instal3 Instal4
# * <int>   <int> <chr>   <int>   <int>   <int>   <int>
# 1     2     450     a      15      10       0       0
# 2     4      90     b      20      15      10       0

The filter call removes rows before Date1 == Date2

With transmute we keep only necessary columns and create the column names that we will spread. We set all values of Capital to the first one as it's the only oe we need. ID is grouped so it's kept by default and is not allowed in transmute.

Then we ungroup and do a textbook spread

base R

In base R we can use split and reshape and follow the same ideas, with a bit of tedious reformatting in the end to pad the narrower sub data.frames.

df_list <- 
lapply(split(df,df$ID),function(x) {
  x <- subset(x,cumsum(Date1==Date2)>0)
  x <- transform(x, Capital=Capital[1], time = seq(nrow(x)))
  reshape(x,idvar=c("ID","Capital","Date1"),direction="wide",sep="",drop="Date2")
})
all_names <- names(df_list[[which.max(lengths(df_list))]])
df_list_full <- lapply(df_list,function(x) {x[setdiff(all_names,names(x))] <- NA;x})
do.call(rbind, df_list_full)

#   ID Capital Date1 Instal1 Instal2 Instal3 Instal4
# 2  2     450     a      15      10       0      NA
# 4  4      90     b      20      15      10       0

data :

df <- read.table(text = "ID  Capital  Instal  Date1 Date2
                2    500      25      a     b
                2    500      20      a     c
                2    450      15      a     a
                2    300      10      a     f
                2    250       0      a     z
                4    100      25      b     a
                4    90       20      b     b
                4    80       15      b     a
                4    75       10      b     f
                4    25        0      b     z",h=T,strin=F)

Upvotes: 2

Andrew Gustar
Andrew Gustar

Reputation: 18425

Here is another tidyverse solution...

library(dplyr)
library(tidyr)

df2 <- df %>% 
  group_by(ID) %>%   #group by ID
  mutate(ind=cumsum(Date1==Date2)) %>%  #mark elements after first Date1==Date2
  filter(ind!=0) %>%  #remove previous elements
  summarise(Date1=first(Date1),
            Capital=first(Capital),
            Instal=list(Instal)) %>%  #capture values for table
  unnest() %>%  #spread Instal, one value per row
  group_by(ID) %>% 
  mutate(Inst=paste0("Instal",row_number())) %>%  #mark names of Instal values
  spread(key=Inst,value=Instal)  #spread into wide format

df2
     ID Date1 Capital Instal1 Instal2 Instal3 Instal4
1     2     a     450      15      10       0      NA
2     4     b      90      20      15      10       0

Upvotes: 1

phiver
phiver

Reputation: 23598

A bit in a roundabout way. I'm sure there is a faster way of doing this, but this gets you exactly what you expect as output.

Steps: when date1 == date2 select the row_number and fill it out after the selection. Filter these records and select only the needed columns. create a column to be used as headers in the spread and spread the Instal data. next join a subset of the data to get the correct value of Capital and join this table with the previous selection.

library(dplyr)
library(tidyr)

df %>%
  group_by(ID) %>% 
  mutate(rowid = row_number(),
         selection = ifelse(Date1 == Date2, rowid, NA)) %>% 
  fill(selection) %>% # fill rowid over the rows after date1 == date2
  filter(!is.na(selection)) %>% 
  select(ID, Date1, Instal) %>% 
  mutate(Installation = paste0("Instal", row_number())) %>% 
  spread(Installation, Instal) %>% 
  inner_join(df %>% filter(Date1 == Date2) %>% select(ID, Date1, Capital), .)

  ID Date1 Capital Instal1 Instal2 Instal3 Instal4
1  2     a     450      15      10       0      NA
2  4     b      90      20      15      10       0

data:

df <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L), 
    Capital = c(500L, 500L, 450L, 300L, 250L, 100L, 90L, 80L, 
    75L, 25L), Instal = c(25L, 20L, 15L, 10L, 0L, 25L, 20L, 15L, 
    10L, 0L), Date1 = c("a", "a", "a", "a", "a", "b", "b", "b", 
    "b", "b"), Date2 = c("b", "c", "a", "f", "z", "a", "b", "a", 
    "f", "z")), class = "data.frame", row.names = c(NA, -10L))

Upvotes: 0

Related Questions