Valeria Arango
Valeria Arango

Reputation: 327

Function to eliminate rows from a dataframe with certain condition in R

everyone!

I will try to explain my problem. It is very difficult for me. I Hope you can help me:

I have a data frame, lets call it DF1, that looks like the next one:

|Symbol |   Date    | Volume |  Price|
|----------------------------|-------|
|A      |2014-01-01 | 0      |    4  |
|A      |2014-01-02 | 7      |    7  |
|A      |2014-01-03 | 8      |    9  |
|A      |2014-01-04 | 1      |    5  |
|B      |2014-01-01 |45      |    6  |
|B      |2014-01-02 |0       |   11  |
|B      |2014-01-03 |34      |    8  |
|B      |2014-01-04 |45      |    5  |
|C      |2014-01-01 |4       |    6  |
|C      |2014-01-02 |0       |    5  |
|C      |2014-01-03 |14      |   25  |
|D      |2014-01-01 |31      |    4  |
|D      |2014-01-02 |7       |    6  |
|D      |2014-01-03 |18      |    3  |
|D      |2014-01-04 |15      |    7  |
|E      |2014-01-01 |13      |    8  |
|E      |2014-01-02 |0       |    9  |

Having this dataframe I create a new dataframe, let's call it DF2, through the following lines of code:

RM <- DF1 %>% group_by(Date) %>%
  mutate(weight = Volume/sum(Volume),      
         R_i = weight*(log(Price)-log(lag(Price)))) %>%          
  summarise(RM = sum(R_i, na.rm = TRUE))

And from RM, I select only the dates that are of my interest :

RM_reg <- subset(RM, date >= "2014-03-05" & date<="2014-09-03")

Finally, RM_reg looks like this:

|   Date    | RM     |
|2014-03-05 | 0      |
|2014-03-06 | 7      | 
|2014-03-07 | 8      |    
|2014-03-08 | 1      |    
|2014-03-09 | 45     |    
|2014-03-10 | 0      |   
|2014-03-11 | 34     |    
|2014-03-12 | 45     |   
|2014-03-13 | 4      |    
|2014-03-14 | 0      |    
|2014-03-15 | 14     |  
|2014-03-16 | 31     |    

It should be noted that the values in the RM_reg column are not the actual values, but only examples. Starting from my original dataframe, RM_reg has 125 rows.

Then, from dataframe DF1, I extract the rows for which the Company column is equal to A through the following code:

DF_A <- DF_1%>%
  filter(Symbol=="A")

And I add a column of returns to the dataframe DF_A, through the following code:

RA <- DF_A %>% group_by(Symbol)%>%
  mutate(Ret_i = log(Price) - lag(log(Price)))

I eliminate the first row, which is NA:

AR <- na.omit(RA)

And from AR, I select only the dates that are of my interest :

AR_reg <- subset(AR, date >= "2014-03-05" & date<="2014-09-03")

AR_reg looks like this:

|Symbol |   Date    | volume |price | Ret_i  |
|--------------------------------------------|
|A      |2014-03-05 | 1      |   5  |   2    |
|A      |2014-03-06 | 3      |   8  |   3    |
|A      |2014-03-07 | 7      |   4  |   4    |
|A      |2014-03-08 |3       |   6  |   5    |
|A      |2014-03-09 |34      |   7  |   1    |
|A      |2014-03-10 |45      |  34  |   4    |
|A      |2014-03-11 |4       |   5  |   3    |
|A      |2014-03-12 |9       |   7  |   5    |
|A      |2014-03-13 |8       |   6  |   6    |
|A      |2014-03-14 |4       |   4  |   1    |
|A      |2014-03-15 |0       |   7  |   4    |
|A      |2014-03-16 |4       |   7  |   7    |

It should be noted that the values in the AR_reg column are not the actual values, but only examples. Starting from my original dataframe, AR_reg also has 125 rows.

Finally, because RM_reg and AR_reg I can regress the Ret_i column of AR_reg on the RM column of RM_reg through the following code:

mod <- lm(AR_reg$Ret_i ~ RM_reg$RM) 

What I need to do is to do the same as described above for all the Symbols in the dataframe DF1, in this case for, "B", "C", "D", "E". The problem is that we do not have the same amount of entries, or the same amount of rows corresponding to all Symbols, and this is a necessary condition to be able to do the regression. To do the regression I need to have 125 observations of returns for each Symbol.

What I have thought is to eliminate the Symbols for which the dataframe similar to AR_reg that is generated does not have 125 entries or rows; but the truth is that I do not know how to do this, I suppose that a function must be raised but this is a subject that I still do not dominate.

Thank you very much for reading me, I hope you have understood me. Any help or suggestion will be very appreciated

Translated with www.DeepL.com/Translator (free version)

Upvotes: 1

Views: 41

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389225

Join DF1 with RM by Date, keep only data between specific dates, for each Symbol calculate Ret_i and drop NA values and create list of models.

The complete code would look like :

library(dplyr)

DF1$Date <- as.Date(DF1$Date)

RM <- DF1 %>% 
  group_by(Date) %>%
  mutate(weight = Volume/sum(Volume),      
         R_i = weight*(log(Price)-log(lag(Price)))) %>%          
  summarise(RM = sum(R_i, na.rm = TRUE))


result <- DF1 %>%
  left_join(RM, by = 'Date') %>%
  filter(between(Date, as.Date("2014-03-05"), as.Date("2014-09-03")))
  group_by(Symbol) %>%
  mutate(Ret_i = log(Price) - lag(log(Price))) %>%
  na.omit() %>%
  summarise(model = list(lm(Ret_i~RM)))

result

Upvotes: 1

Related Questions