Reputation: 2454
I am trying to organize multidimensional data
in R
. The data is extracted in R
from CSV
file. My data in data frame of R
is, as following:
Rank Arrangers YearAmt
1994
1 JPM 6,605.00
2 UBS 7,806.00
3 RBS 1,167.34
1995
1 Citi 1,150.00
2 Scotiabank 483.33
3 ING 800.56
4 UniCredit 700.70
This is just a toy data. Original dataset is large. I would like to subset the data by year
like 1994
, 1995
etc. So that I can conduct some analysis. I have tried to subset the data set
by factor/level
using sapply
and subset
. But, I realized R
is just treating 1994
and 1995
as a data in a row. I am thinking to format the original csv
file by creating Year
as a separate column
and then putting a corresponding year
in a field
for all the rows
.
I would appreciate any help in suggesting a way to organize data in R
. I am expecting an output like this:
Rank Arrangers YearAmt Year
1 JPM 6,605.00 1994
2 UBS 7,806.00 1994
3 RBS 1,167.34 1994
1 Citi 1,150.00 1995
2 Scotiabank 483.33 1995
3 ING 800.56 1995
4 UniCredit 700.70 1995
Upvotes: 1
Views: 956
Reputation: 270428
1) ave Using cumsum(Rank == "")
to create a grouping variable for years, this uses ave
to create a Year
column creating within each group of year rows a Year
consisting of NA followed by the year repeated. Finally use na.omit
to remove the rows with NA. No packages are used:
na.year <- function(x) c(NA, rep(x[1], length(x) - 1)) # c(NA, x[1], x[1], ..., x[1])
na.omit( transform(df1, Year = ave(YearAmt, cumsum(Rank == ""), FUN = na.year)) )
Using the input df1
reproducibly defined in the answer from @akrun we get:
Rank Arrangers YearAmt Year
2 1 JPM 6,605.00 1994
3 2 UBS 7,806.00 1994
4 3 RBS 1,167.34 1994
6 1 Citi 1,150.00 1995
7 2 Scotiabank 483.33 1995
8 3 ING 800.56 1995
9 4 UniCredit 700.70 1995
2) by Using by
split df1
into years applying addYear
to each component of the split. Finally put them back together. No packages are used.
addYear <- function(x) cbind(x[-1, ], Year = x[1, "YearAmt"])
do.call("rbind", by(df1, cumsum(df1$Rank == ""), addYear))
3) sqldf Using the sqldf package we can join each row of df1
with all prior rows of itself having a zero length rank Rank
taking the maximum YearAmt
of those to form the Year
. Then keep only those rows having a non-zero length Rank
.
library(sqldf)
sqldf("select b.*, max(a.YearAmt) Year
from df1 a join df1 b on a.rowid < b.rowid and a.Rank = ''
group by b.rowid
having b.Rank != ''")
Upvotes: 2
Reputation: 887981
We create a logical vector based on the blank elements in 'Rank' ('i1'), then subset the rows of 'df1' by removing all the blank rows using 'i1' (df1[!i1,]
) and transform
the dataset to create the 'Year' column by replicating the 'YearAmt' (that corresponds to the blank in 'Rank') using the cumulative sum of 'i1'.
i1 <- df1$Rank == ''
res <- transform(df1[!i1,], Year = df1$YearAmt[i1][cumsum(i1)[!i1]])
res
# Rank Arrangers YearAmt Year
#2 1 JPM 6,605.00 1994
#3 2 UBS 7,806.00 1994
#4 3 RBS 1,167.34 1994
#6 1 Citi 1,150.00 1995
#7 2 Scotiabank 483.33 1995
#8 3 ING 800.56 1995
#9 4 UniCredit 700.70 1995
Or as @G.Grothendieck mentioned in the comments, the transform
step can be made compact by
res <- transform(df1, Year = YearAmt[i1][cumsum(i1)])[!i1, ]
row.names(res) <- NULL
NOTE: No external packages are needed. Only baseverse
..
Or using dtverse/zooverse
library(data.table)
library(zoo)
setDT(df1)[Rank=='', Year:= YearAmt][, Year := na.locf(Year)][Rank!='']
# Rank Arrangers YearAmt Year
#1: 1 JPM 6,605.00 1994
#2: 2 UBS 7,806.00 1994
#3: 3 RBS 1,167.34 1994
#4: 1 Citi 1,150.00 1995
#5: 2 Scotiabank 483.33 1995
#6: 3 ING 800.56 1995
#7: 4 UniCredit 700.70 1995
df1 <- structure(list(Rank = c("", "1", "2", "3", "", "1", "2", "3",
"4"), Arrangers = c("", "JPM", "UBS", "RBS", "", "Citi", "Scotiabank",
"ING", "UniCredit"), YearAmt = c("1994", "6,605.00", "7,806.00",
"1,167.34", "1995", "1,150.00", "483.33", "800.56", "700.70")),
.Names = c("Rank",
"Arrangers", "YearAmt"), row.names = c(NA, -9L), class = "data.frame")
Upvotes: 1
Reputation: 43364
A tidyverse option:
library(dplyr)
library(tidyr)
# add Year column, with NAs where no year in row
df %>% mutate(Year = ifelse(Rank == '' & Arrangers == '', YearAmt, NA)) %>%
# fill year downwards
fill(Year) %>%
# chop out year rows
filter(Rank != '', Arrangers != '')
## Rank Arrangers YearAmt Year
## 1 1 JPM 6,605.00 1994
## 2 2 UBS 7,806.00 1994
## 3 3 RBS 1,167.34 1994
## 4 1 Citi 1,150.00 1995
## 5 2 Scotiabank 483.33 1995
## 6 3 ING 800.56 1995
## 7 4 UniCredit 700.70 1995
Upvotes: 1