Achak
Achak

Reputation: 1296

Subsetting dataset for weekday and weekend and calculating sum of columns

I am working with a hourly dataset for 1 year and trying to calculate sum of weekday (Mon - Fri) and weekend (sat, sun) values and add to a new row at the bottom of the data set. my data frame looks like this:

enter image description here

I have created the sum for all week with the following command:

## calculate column sum
df[366,(3:27)] <- colSums(df[,3:27], na.rm = T)

Then calculated the number of missing values in each column with this:

## calculate number of missing values in a column

NA.find <- function(x) length(which(is.na(x))) #function for finding missing values
myNumCols <- which(unlist(lapply(df, is.numeric))) 
df[(nrow(df) + 1), myNumCols] <- sapply(df[, myNumCols],NA.find )

Now I am trying to calculate the weekday and weekend sum between (Column v1:total) and also number of missing values and add the the bottom rows like I did for the all days.

I know that I need to somehow define my date so that R can read it and then use some how to subset weekday and weekend, I figure out how to subset individual like this:

e = colsplit(df1$date,split=" ",names=c("day1","day2","month"))

df2 = cbind(df1[,c("type","date")],day=e[1],cdate=e[2],month=e[3],df1[,3:ncol(df1)])
df3 = subset(df2, day1 == "Mon")

But I am not sure how can subset it say Mon:Friday and again Saturday:Sunday?

What I am trying to achieve is to create 4 new rows from (368:371) and add weekday total, weekday number of missing values, weekend total, weekend number of missing values.

Many thanks, Ayan

Upvotes: 1

Views: 2672

Answers (1)

csgillespie
csgillespie

Reputation: 60492

A few comments.

  1. Don't add column sums to your data frame. For example, in your code, you have:

    df[366,(3:27)] <- colSums(df[,3:27], na.rm = T)
    

    Suppose we subset your data frame, to every weekend. You column sums are wrong. Store them as a separate vector:

    year_sums =colSums(df[,3:27], na.rm = TRUE)
    
  2. Since you no longer have column sums in your data frame, it's easier to query. So to determine the number of missing values, do something like:

    colSums(is.na(dd[,3:27])
    

    R treats TRUE as 1 and FALSE as 0.

  3. Now to answer your question. The clever way is to use R's date object or use the lubridate package. However, I would suggest you start with something a bit easier. Just create a new column and subset. In your example, the first day of the week is Friday. So

    ##M for mid week
    ##W for week end 
    df$type = c("M", "W", "W", "M", "M", "M", "M")
    

    creates a column identifying if the data is weekend or mid-week. So to subset, just:

    df[df$type == "M",]
    

Upvotes: 2

Related Questions