BuJay
BuJay

Reputation: 127

How do I manipulate a data.frame to arrive at this desired result? Lag somehow?

I am trying to manipulate a data table in a difficult to describe manner. My gut tells me it’s some kind of lag but I'm not sure - so I'll illustrate.

I can do this easily in excel but my data set is too large for excel to handle efficiently.

Initial Data:

Column1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C")
Column2 <- c(201801, 201802, 201803, 201804, 201803, 201804, 201805, 201803, 201804)
Column3 <- c("Active", "Active", "Active", "Closed", "Active", "Active", "CO", "Active", "BK")
Column4 <- c(100, 97, 95, 0, 50, 45, 45, 100, 90)
(dat <- dplyr::tibble(Column1, Column2, Column3, Column4))

# A tibble: 9 x 4
  Column1 Column2 Column3 Column4
  <chr>     <dbl> <chr>     <dbl>
1 A        201801 Active      100
2 A        201802 Active       97
3 A        201803 Active       95
4 A        201804 Closed        0
5 B        201803 Active       50
6 B        201804 Active       45
7 B        201805 CO           45
8 C        201803 Active      100
9 C        201804 BK           90

Desired output:

Col1 <- c("A", "A", "A", "B", "B", "C")
Col2 <- c(201010, 201802, 201803, 201003, 201804, 201803)
Col3 <- c(201802, 201803, 201804, 201804, 201805, 201804)
Col4 <- c("Active", "Active", "Active", "Active", "Active", "Active")
Col5 <- c("Active", "Active", "Closed", "Active", "CO", "BK")
Col6 <- c(100, 97, 95, 50, 45, 100)
Col7 <- c(97, 95, 0, 45, 45, 90)
(dat_desired <- dplyr::tibble(Col1, Col2, Col3, Col4, Col5, Col6, Col7))

# A tibble: 6 x 7
  Col1    Col2   Col3 Col4   Col5    Col6  Col7
  <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl> <dbl>
1 A     201001 201802 Active Active   100    97
2 A     201802 201803 Active Active    97    95
3 A     201803 201804 Active Closed    95     0
4 B     201003 201804 Active Active    50    45
5 B     201804 201805 Active CO        45    45
6 C     201803 201804 Active BK       100    90

Incidentally, based on some of the suggestions below, I have tried the following (but am generating the errors below):

library(zoo)

R <- read_excel("H:/R Programs/R_Data.xlsx")

Column1 = as.vector(R[,1])
Column2 = as.vector(R[,2])
Column3 = as.vector(R[,3])
Column4 = as.vector(R[,4])
Column5 = as.vector(R[,5])

(dat <- dplyr::tibble(Column1, Column2, Column3, Column4, Column5))

# A tibble: 415,533 x 5
Column1$Loan_Key Column2$File_Run_Date Column3$Status Column4$Days Column5$Bal
<dbl>                 <dbl> <chr>                 <dbl>       <dbl>
1         11111111              20180201 ACTIVE                  -19      24472.
2         11111111              20180301 ACTIVE                  -19      24264.
3         11111111              20180401 ACTIVE                  -19      23991.
4         11111111              20180501 BK                      -49      23350.
5         11111111              20180601 BK                      -19      23488.
6         11111111              20180701 BK                      -19      23169.
7         11111111              20180801 BK                      -19      23008.
8         11111111              20180901 BK                      -19      22693.
9         11111111              20181001 BK                      -19      22378.
10        11111111              20181101 BK                      -19      22192.

# ... with 415,523 more rows

(data.frame(rollapply(data = dat, 2, c)) %>% filter(X1 == X2) %>% 
    select(-X2) %>% setNames(paste0("Col", 1:9)))

Error in ncol(xj) : object 'xj' not found

Upvotes: 0

Views: 86

Answers (3)

Paul
Paul

Reputation: 2959

You can do this with the rollapply() function from zoo:

library(plyr)
library(dplyr)
library(zoo)

rollapply(data = dat, 2, c) %>% # returns a character matrix
  data.frame() %>%
  colwise(type.convert, as.is = T)(.) %>% # Guesses column classes
  filter(X1 == X2) %>% # only reports the same Column1 values
  select(-X2) %>%
  setNames(paste0("Col", 1:7)) %>%
  as_tibble() # optional

# A tibble: 6 x 7
  Col1    Col2   Col3 Col4   Col5    Col6  Col7
  <chr>  <int>  <int> <chr>  <chr>  <int> <int>
1 A     201801 201802 Active Active   100    97
2 A     201802 201803 Active Active    97    95
3 A     201803 201804 Active Closed    95     0
4 B     201803 201804 Active Active    50    45
5 B     201804 201805 Active CO        45    45
6 C     201803 201804 Active BK       100    90

This assumes the rows are already in the correct order.

Upvotes: 1

BuJay
BuJay

Reputation: 127

Thanks to Paul (and to you others) I think this gets me to where I need to be. it took me quite a while to implement what ended up being 1 line.... :(

R <- read_excel("H:/R Programs/R_Data.xlsx")
x=data.frame(rollapply(data = R, 2, c)) #THIS WAS THE KEY STEP
Transition = x %>% filter(as.character(X1)==as.character(X2)) %>% select(-X2) #THIS AS.CHARACTER ALSO CAME INTO PLAY DUE TO DIFFERENT FACTOR LEVELS (FOR SOME REASON)
write_xlsx(x=Transition, path="C:/Transition_Matrix_Data.xlsx")

Upvotes: 0

Maurits Evers
Maurits Evers

Reputation: 50668

Isn't this basically just a self-join (i.e. a left_join of dat with dat on "Column1")?

library(tidyverse)
dat %>%
    left_join(dat, by = c("Column1")) %>%
    filter(Column2.y - Column2.x == 1) %>%
    select(
        Col1 = Column1,
        Col2 = Column2.x, Col3 = Column2.y,
        Col4 = Column3.x, Col5 = Column3.y,
        Col6 = Column4.x, Col7 = Column4.y)
## A tibble: 6 x 7
#  Col1    Col2   Col3 Col4   Col5    Col6  Col7
#  <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl> <dbl>
#1 A     201801 201802 Active Active   100    97
#2 A     201802 201803 Active Active    97    95
#3 A     201803 201804 Active Closed    95     0
#4 B     201803 201804 Active Active    50    45
#5 B     201804 201805 Active CO        45    45
#6 C     201803 201804 Active BK       100    90

Upvotes: 2

Related Questions