Santosh M.
Santosh M.

Reputation: 2454

Organizing Multidimensional Data in R

I am trying to organize multidimensional data in R. The data is extracted in R from CSV file. My data in data frame of R is, as following:

 Rank     Arrangers     YearAmt
                         1994
 1         JPM          6,605.00
 2         UBS          7,806.00
 3         RBS          1,167.34

                         1995  
 1         Citi         1,150.00
 2         Scotiabank   483.33
 3         ING          800.56
 4         UniCredit    700.70

This is just a toy data. Original dataset is large. I would like to subset the data by year like 1994, 1995 etc. So that I can conduct some analysis. I have tried to subset the data set by factor/level using sapply and subset. But, I realized R is just treating 1994 and 1995 as a data in a row. I am thinking to format the original csv file by creating Year as a separate column and then putting a corresponding year in a field for all the rows.

I would appreciate any help in suggesting a way to organize data in R. I am expecting an output like this:

Rank       Arrangers    YearAmt    Year 
 1         JPM          6,605.00   1994
 2         UBS          7,806.00   1994
 3         RBS          1,167.34   1994
 1         Citi         1,150.00   1995
 2         Scotiabank   483.33     1995
 3         ING          800.56     1995
 4         UniCredit    700.70     1995

Upvotes: 1

Views: 956

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 270428

1) ave Using cumsum(Rank == "") to create a grouping variable for years, this uses ave to create a Year column creating within each group of year rows a Year consisting of NA followed by the year repeated. Finally use na.omit to remove the rows with NA. No packages are used:

na.year <- function(x) c(NA, rep(x[1], length(x) - 1))  # c(NA, x[1], x[1], ..., x[1])
na.omit( transform(df1, Year = ave(YearAmt, cumsum(Rank == ""), FUN = na.year)) ) 

Using the input df1 reproducibly defined in the answer from @akrun we get:

  Rank  Arrangers  YearAmt Year
2    1        JPM 6,605.00 1994
3    2        UBS 7,806.00 1994
4    3        RBS 1,167.34 1994
6    1       Citi 1,150.00 1995
7    2 Scotiabank   483.33 1995
8    3        ING   800.56 1995
9    4  UniCredit   700.70 1995

2) by Using by split df1 into years applying addYear to each component of the split. Finally put them back together. No packages are used.

addYear <- function(x) cbind(x[-1, ], Year = x[1, "YearAmt"])
do.call("rbind", by(df1, cumsum(df1$Rank == ""), addYear))

3) sqldf Using the sqldf package we can join each row of df1 with all prior rows of itself having a zero length rank Rank taking the maximum YearAmt of those to form the Year. Then keep only those rows having a non-zero length Rank.

library(sqldf)

sqldf("select b.*, max(a.YearAmt) Year 
       from df1 a join df1 b on a.rowid < b.rowid and a.Rank = '' 
       group by b.rowid 
       having b.Rank != ''")

Upvotes: 2

akrun
akrun

Reputation: 887981

We create a logical vector based on the blank elements in 'Rank' ('i1'), then subset the rows of 'df1' by removing all the blank rows using 'i1' (df1[!i1,]) and transform the dataset to create the 'Year' column by replicating the 'YearAmt' (that corresponds to the blank in 'Rank') using the cumulative sum of 'i1'.

i1 <- df1$Rank == ''
res <- transform(df1[!i1,], Year = df1$YearAmt[i1][cumsum(i1)[!i1]])
res
#  Rank  Arrangers  YearAmt Year 
#2    1        JPM 6,605.00 1994
#3    2        UBS 7,806.00 1994
#4    3        RBS 1,167.34 1994
#6    1       Citi 1,150.00 1995
#7    2 Scotiabank   483.33 1995
#8    3        ING   800.56 1995
#9    4  UniCredit   700.70 1995

Or as @G.Grothendieck mentioned in the comments, the transform step can be made compact by

res <- transform(df1, Year = YearAmt[i1][cumsum(i1)])[!i1, ]
row.names(res) <- NULL

NOTE: No external packages are needed. Only baseverse..


Or using dtverse/zooverse

library(data.table)
library(zoo)
setDT(df1)[Rank=='', Year:= YearAmt][, Year := na.locf(Year)][Rank!='']
#   Rank  Arrangers  YearAmt Year
#1:    1        JPM 6,605.00 1994
#2:    2        UBS 7,806.00 1994
#3:    3        RBS 1,167.34 1994
#4:    1       Citi 1,150.00 1995
#5:    2 Scotiabank   483.33 1995 
#6:    3        ING   800.56 1995
#7:    4  UniCredit   700.70 1995

data

df1 <- structure(list(Rank = c("", "1", "2", "3", "", "1", "2", "3", 
"4"), Arrangers = c("", "JPM", "UBS", "RBS", "", "Citi", "Scotiabank",   
"ING", "UniCredit"), YearAmt = c("1994", "6,605.00", "7,806.00", 
"1,167.34", "1995", "1,150.00", "483.33", "800.56", "700.70")), 
.Names = c("Rank", 
"Arrangers", "YearAmt"), row.names = c(NA, -9L), class = "data.frame")

Upvotes: 1

alistaire
alistaire

Reputation: 43364

A tidyverse option:

library(dplyr)
library(tidyr)

       # add Year column, with NAs where no year in row
df %>% mutate(Year = ifelse(Rank == '' & Arrangers == '', YearAmt, NA)) %>% 
  # fill year downwards
  fill(Year) %>% 
  # chop out year rows
  filter(Rank != '', Arrangers != '')

##   Rank  Arrangers  YearAmt Year
## 1    1        JPM 6,605.00 1994
## 2    2        UBS 7,806.00 1994
## 3    3        RBS 1,167.34 1994
## 4    1       Citi 1,150.00 1995
## 5    2 Scotiabank   483.33 1995
## 6    3        ING   800.56 1995
## 7    4  UniCredit   700.70 1995

Upvotes: 1

Related Questions