Reputation: 539
would like to identify the number of firms that start and end each month. The goal is to say by column how much firms start and end.
My data looks like this, with many more rows and columns.
Firm Return_1990_01 Return_1990_02 Return_1990_03 Return_1990_04 Return_1990_05
#1 fg23 NaN NaN 1.54 2.34 .641
#2 sdf1 1.35 NaN 3.53 NaN .231
#3 sdf1 1.12 2.44 1.51 1.64 NaN
One challenge is that a firm can have NaNs in between. For example, Row 2 the firm begins 1990_01 and ends 1990_05 despite NaNs in between.
I tried the following code
library(dplyr)
library(tidyr)
df %>%
gather(month, value, -Firm) %>%
filter(!is.nan(value)) %>%
arrange(Firm, month) %>%
group_by(Firm) %>%
summarise(start = first(month), end = last(month))
But get the following error message
Error in arrange_impl(.data, dots) :
data frame column with incompatible number of rows (465), expecting : 59378
Any help is appreciated.
Upvotes: 2
Views: 76
Reputation: 886948
With tidyverse
, we can do this without any reshaping with pmap
. Find the names
of the elements that are not NaN with which
, get the first
and last
column names
library(tidyverse)
df %>%
transmute(Firm, start_end = pmap(.[-1], ~
which(!is.nan(c(...))) %>%
names %>%
range %>%
{tibble(start = first(.), end = last(.))})) %>%
unnest
# Firm start end
#1 fg23 Return_1990_03 Return_1990_05
#2 sdf1 Return_1990_01 Return_1990_05
#3 sdf1 Return_1990_01 Return_1990_04
In base R
, we can also do this in a vectorized way with max.col
m1 <- !is.na(df[-1])
start <- colnames(m1)[max.col(m1, "first")]
end <- colnames(m1)[max.col(m1, "last")]
cbind(df1['Firm'], start, end)
# Firm start end
#1 fg23 Return_1990_03 Return_1990_05
#2 sdf1 Return_1990_01 Return_1990_05
#3 sdf1 Return_1990_01 Return_1990_04
Upvotes: 0
Reputation: 388817
Another way to represent this with column names using tidyverse
. We gather
the data into long format and select only first and last value for each row. Create a new column (temp
) which holds "Start"
and "End"
for each group and spread
it to wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
gather(key, value, -Firm, -row, na.rm = TRUE) %>%
group_by(row) %>%
slice(c(1L, n())) %>%
mutate(temp = c("Start", "End")) %>%
select(-value) %>%
spread(temp, key) %>%
ungroup %>%
select(-row) %>%
select(Firm, Start, End)
# Firm Start End
# <fct> <chr> <chr>
#1 fg23 Return_1990_03 Return_1990_05
#2 sdf1 Return_1990_01 Return_1990_05
#3 sdf1 Return_1990_01 Return_1990_04
Upvotes: 1
Reputation: 31452
You can do
apply(df[,-1], 1, function(x) range(which(!is.nan(x))))
# [,1] [,2] [,3]
# [1,] 3 1 1
# [2,] 5 5 4
If you want to add names to rows and columns then we can extend this as:
apply(df[,-1], 1, function(x) range(which(!is.nan(x)))) %>%
t %>%
`colnames<-`(c('First','Last')) %>%
`row.names<-`(df[,1])
# First Last
# fg23 3 5
# sdf1 1 5
# sdf1 1 4
Upvotes: 2