Reputation: 303
I have a data frame as shown below:
ID Capital Instal Date1 Date2
2 500 25 a b
2 500 20 a c
2 450 15 a a
2 300 10 a f
2 250 0 a z
4 100 25 b a
4 90 20 b b
4 80 15 b a
4 75 10 b f
4 25 0 b z
I want to create a new data.frame
from this, where if Date1=Date2
then my new data frame B
would look like the one below:
ID Date1 Capital Instal1 Instal2 Instal3 Instal4
2 a 450 15 10 0
4 b 90 20 15 10 0
So I want the new data.frame
to only consider the data after Date1
and Date2
are equal.
Upvotes: 4
Views: 61
Reputation: 47320
tidyverse
Here is a tidyverse
approach (dplyr
+ tidyr
):
library(tidyverse)
df2 <- df %>%
group_by(ID) %>%
filter(cumsum(Date1 == Date2) >0) %>%
transmute(Capital=Capital[1],Instal,Date1,colnames = paste0("Instal",seq(n()))) %>%
ungroup %>%
spread(colnames,Instal)
df2[is.na(df2)] <- 0 # omit if you'd rather have NA
# # A tibble: 2 x 7
# ID Capital Date1 Instal1 Instal2 Instal3 Instal4
# * <int> <int> <chr> <int> <int> <int> <int>
# 1 2 450 a 15 10 0 0
# 2 4 90 b 20 15 10 0
The filter
call removes rows before Date1 == Date2
With transmute
we keep only necessary columns and create the column names that we will spread. We set all values of Capital
to the first one as it's the only oe we need. ID
is grouped so it's kept by default and is not allowed in transmute
.
Then we ungroup
and do a textbook spread
base R
In base R we can use split
and reshape
and follow the same ideas, with a bit of tedious reformatting in the end to pad the narrower sub data.frames.
df_list <-
lapply(split(df,df$ID),function(x) {
x <- subset(x,cumsum(Date1==Date2)>0)
x <- transform(x, Capital=Capital[1], time = seq(nrow(x)))
reshape(x,idvar=c("ID","Capital","Date1"),direction="wide",sep="",drop="Date2")
})
all_names <- names(df_list[[which.max(lengths(df_list))]])
df_list_full <- lapply(df_list,function(x) {x[setdiff(all_names,names(x))] <- NA;x})
do.call(rbind, df_list_full)
# ID Capital Date1 Instal1 Instal2 Instal3 Instal4
# 2 2 450 a 15 10 0 NA
# 4 4 90 b 20 15 10 0
data :
df <- read.table(text = "ID Capital Instal Date1 Date2
2 500 25 a b
2 500 20 a c
2 450 15 a a
2 300 10 a f
2 250 0 a z
4 100 25 b a
4 90 20 b b
4 80 15 b a
4 75 10 b f
4 25 0 b z",h=T,strin=F)
Upvotes: 2
Reputation: 18425
Here is another tidyverse
solution...
library(dplyr)
library(tidyr)
df2 <- df %>%
group_by(ID) %>% #group by ID
mutate(ind=cumsum(Date1==Date2)) %>% #mark elements after first Date1==Date2
filter(ind!=0) %>% #remove previous elements
summarise(Date1=first(Date1),
Capital=first(Capital),
Instal=list(Instal)) %>% #capture values for table
unnest() %>% #spread Instal, one value per row
group_by(ID) %>%
mutate(Inst=paste0("Instal",row_number())) %>% #mark names of Instal values
spread(key=Inst,value=Instal) #spread into wide format
df2
ID Date1 Capital Instal1 Instal2 Instal3 Instal4
1 2 a 450 15 10 0 NA
2 4 b 90 20 15 10 0
Upvotes: 1
Reputation: 23598
A bit in a roundabout way. I'm sure there is a faster way of doing this, but this gets you exactly what you expect as output.
Steps: when date1 == date2 select the row_number and fill it out after the selection. Filter these records and select only the needed columns. create a column to be used as headers in the spread and spread the Instal data. next join a subset of the data to get the correct value of Capital and join this table with the previous selection.
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
mutate(rowid = row_number(),
selection = ifelse(Date1 == Date2, rowid, NA)) %>%
fill(selection) %>% # fill rowid over the rows after date1 == date2
filter(!is.na(selection)) %>%
select(ID, Date1, Instal) %>%
mutate(Installation = paste0("Instal", row_number())) %>%
spread(Installation, Instal) %>%
inner_join(df %>% filter(Date1 == Date2) %>% select(ID, Date1, Capital), .)
ID Date1 Capital Instal1 Instal2 Instal3 Instal4
1 2 a 450 15 10 0 NA
2 4 b 90 20 15 10 0
data:
df <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L),
Capital = c(500L, 500L, 450L, 300L, 250L, 100L, 90L, 80L,
75L, 25L), Instal = c(25L, 20L, 15L, 10L, 0L, 25L, 20L, 15L,
10L, 0L), Date1 = c("a", "a", "a", "a", "a", "b", "b", "b",
"b", "b"), Date2 = c("b", "c", "a", "f", "z", "a", "b", "a",
"f", "z")), class = "data.frame", row.names = c(NA, -10L))
Upvotes: 0