CISCO
CISCO

Reputation: 539

Can I identify by row the first and last observation

would like to identify the number of firms that start and end each month. The goal is to say by column how much firms start and end.

My data looks like this, with many more rows and columns.

  Firm   Return_1990_01  Return_1990_02 Return_1990_03 Return_1990_04 Return_1990_05 
#1 fg23         NaN             NaN             1.54          2.34        .641      
#2 sdf1         1.35            NaN             3.53          NaN         .231     
#3 sdf1         1.12            2.44            1.51          1.64         NaN     

One challenge is that a firm can have NaNs in between. For example, Row 2 the firm begins 1990_01 and ends 1990_05 despite NaNs in between.

I tried the following code

library(dplyr)
library(tidyr)

df %>% 
  gather(month, value, -Firm) %>% 
  filter(!is.nan(value)) %>% 
  arrange(Firm, month) %>% 
  group_by(Firm) %>% 
  summarise(start = first(month), end = last(month))

But get the following error message

Error in arrange_impl(.data, dots) : 
  data frame column with incompatible number of rows (465), expecting : 59378

Any help is appreciated.

Upvotes: 2

Views: 76

Answers (3)

akrun
akrun

Reputation: 886948

With tidyverse, we can do this without any reshaping with pmap. Find the names of the elements that are not NaN with which, get the first and last column names

library(tidyverse)
df  %>% 
   transmute(Firm, start_end = pmap(.[-1], ~ 
       which(!is.nan(c(...))) %>%
       names %>%
       range %>%
       {tibble(start = first(.), end = last(.))})) %>%
   unnest
#  Firm          start            end
#1 fg23 Return_1990_03 Return_1990_05
#2 sdf1 Return_1990_01 Return_1990_05
#3 sdf1 Return_1990_01 Return_1990_04

In base R, we can also do this in a vectorized way with max.col

m1 <- !is.na(df[-1])
start <- colnames(m1)[max.col(m1, "first")]
end <- colnames(m1)[max.col(m1, "last")]
cbind(df1['Firm'], start, end)
#  Firm          start            end
#1 fg23 Return_1990_03 Return_1990_05
#2 sdf1 Return_1990_01 Return_1990_05
#3 sdf1 Return_1990_01 Return_1990_04

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388817

Another way to represent this with column names using tidyverse. We gather the data into long format and select only first and last value for each row. Create a new column (temp) which holds "Start" and "End" for each group and spread it to wide format.

library(dplyr)
library(tidyr)

df %>%
  mutate(row = row_number()) %>%
  gather(key, value, -Firm, -row, na.rm = TRUE) %>%
  group_by(row) %>%
  slice(c(1L, n())) %>%
  mutate(temp = c("Start", "End")) %>%
  select(-value) %>%
  spread(temp, key) %>%
  ungroup %>%
  select(-row) %>%
  select(Firm, Start, End)

#  Firm  Start          End           
#  <fct> <chr>          <chr>         
#1 fg23  Return_1990_03 Return_1990_05
#2 sdf1  Return_1990_01 Return_1990_05
#3 sdf1  Return_1990_01 Return_1990_04

Upvotes: 1

dww
dww

Reputation: 31452

You can do

apply(df[,-1], 1, function(x) range(which(!is.nan(x))))
#      [,1] [,2] [,3]
# [1,]    3    1    1
# [2,]    5    5    4

If you want to add names to rows and columns then we can extend this as:

apply(df[,-1], 1, function(x) range(which(!is.nan(x)))) %>%
  t %>%
  `colnames<-`(c('First','Last')) %>%
  `row.names<-`(df[,1])
#      First Last
# fg23     3    5
# sdf1     1    5
# sdf1     1    4

Upvotes: 2

Related Questions