Indrute
Indrute

Reputation: 135

Count events with a specific condition for event meaning in R

I am a beginner in R, but until this moment I was able to find all answers related to R coding. This time I even don't know where to start, so I would be grateful for your help. I have a .csv file, the fragment of my data is below:

   `year `               `1k`  `2k`  `3k`  `4k`  `5k`                  
 1 `1981-01-01 00:00:00 NA    NA    NA    NA    NA`   
 2 `1981-01-02 00:00:00 NA    NA    NA    NA    NA`   
 3 `1981-01-03 00:00:00 NA    NA    NA    NA    NA`   
 4 `1981-01-04 00:00:00 NA    NA    COLD  COLD  COLD`   
 5 `1981-01-05 00:00:00 NA    NA    NA    COLD  NA`   
 6 `1981-01-06 00:00:00 COLD  NA    NA    COLD  NA`   
 7 `1981-01-07 00:00:00 COLD  NA    NA    COLD  NA`   
 8 `1981-01-08 00:00:00 COLD  NA    NA    COLD  COLD`   
 9 `1981-01-09 00:00:00 COLD  NA    NA    NA    NA`   
10 `1981-01-10 00:00:00 NA    NA    NA    NA    NA`   
11 `1981-01-11 00:00:00 NA    COLD  NA    NA    NA`   
12 `1981-01-12 00:00:00 NA    COLD  NA    COLD  COLD`   
13 `1981-01-13 00:00:00 NA    NA    NA    NA    NA`   
14 `1981-01-14 00:00:00 COLD  NA    NA    NA    NA`   
15 `1981-01-15 00:00:00 NA    NA    NA    NA    NA`   
16 `1981-01-16 00:00:00 COLD  NA    NA    NA    NA`   
17 `1981-01-17 00:00:00 NA    NA    NA    NA    NA`   
18 `1981-01-18 00:00:00 NA    NA    NA    COLD  NA`   
19 `1981-01-19 00:00:00 NA    NA    NA    COLD  NA`   
20 `1981-01-20 00:00:00 NA    NA    NA    COLD  NA`  

I know how to count the number of COLD events, but I need a specific condition -> to count all events with 3 or more COLD records in a row and then to sum all those events inside each column k1, k2, k3...etc. separately. For the given example of data, it will be: k1=1, k2=0, k3=0, k4=2, k5=0. I was thinking to use repeat loops, but really don't know where to start.

Upvotes: 2

Views: 192

Answers (4)

Rodrigo Lustosa
Rodrigo Lustosa

Reputation: 76

I think the best option is to use loops as you said. I rewrote your example below, so it's reproducible.

col_year <- c("1981-01-01 00:00:00","1981-01-02 00:00:00","1981-01-03 00:00:00","1981-01-04 00:00:00",
              "1981-01-05 00:00:00","1981-01-06 00:00:00","1981-01-07 00:00:00","1981-01-08 00:00:00",
              "1981-01-09 00:00:00","1981-01-10 00:00:00","1981-01-11 00:00:00","1981-01-12 00:00:00",
              "1981-01-13 00:00:00","1981-01-14 00:00:00","1981-01-15 00:00:00","1981-01-16 00:00:00",
              "1981-01-17 00:00:00","1981-01-18 00:00:00","1981-01-19 00:00:00","1981-01-20 00:00:00")


k1 <- c(rep(NA,5),rep("COLD",4),rep(NA,4),"COLD",NA,"COLD",rep(NA,4))
k2 <- c(rep(NA,10),rep("COLD",2),rep(NA,8))
k3 <- c(rep(NA,3),"COLD",rep(NA,16))
k4 <- c(rep(NA,3),rep("COLD",5),rep(NA,3),"COLD",rep(NA,5),rep("COLD",3))
k5 <- c(rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,8))

df <- data.frame(year=col_year,k1,k2,k3,k4,k5)

I will subset the dataframe to be easier to use indexes, as made by @denisafonin (answer excluded). Then I will initialize two counters, one for counts of "COLD" events and one for counts of groups of three or more "COLD" events. Then I'll use a for loop to count this things and another for loop to make it for each column. Here is the code:

# subset columns
n_total_columns <- ncol(df)
df2 <- df[2:n_total_columns]

# initializing variables
n_rows <- nrow(df2)
n_cols <- ncol(df2)
total_counts <- rep(0,n_cols)
names(total_counts) <- names(df2)

# for each k#
for (j in 1:n_cols){
  cold_counts  <- 0
  for (i in 1:n_rows){
    # case for NA's
    if(is.na(df2[i,j])){
      # reset to zero
      cold_counts <- 0
    }else{
      if(df2[i,j] == "COLD"){
        #sum 1
        cold_counts <- cold_counts + 1
      }else{
        # reset to zero
        cold_counts <- 0
      }
    }
      
    # counts equal 3
    if(cold_counts == 3)
      # sum 1 to the total
      total_counts[j] <- total_counts[j] + 1
  }
}

# your result
print(total_counts)

# or in a data.frame
df_final <- data.frame(k_j = names(total_counts),total_counts)
print(df_final)

If there was no NA's in your data, you could remove if(is.na(df2[i,j])) and leave just what is inside the else.

Upvotes: 1

denisafonin
denisafonin

Reputation: 1136

Using the example by @rodrigo-lustosa:

col_year <- c("1981-01-01 00:00:00","1981-01-02 00:00:00","1981-01-03 00:00:00","1981-01-04 00:00:00",
              "1981-01-05 00:00:00","1981-01-06 00:00:00","1981-01-07 00:00:00","1981-01-08 00:00:00",
              "1981-01-09 00:00:00","1981-01-10 00:00:00","1981-01-11 00:00:00","1981-01-12 00:00:00",
              "1981-01-13 00:00:00","1981-01-14 00:00:00","1981-01-15 00:00:00","1981-01-16 00:00:00",
              "1981-01-17 00:00:00","1981-01-18 00:00:00","1981-01-19 00:00:00","1981-01-20 00:00:00")


k1 <- c(rep(NA,5),rep("COLD",4),rep(NA,4),"COLD",NA,"COLD",rep(NA,4))
k2 <- c(rep(NA,10),rep("COLD",2),rep(NA,8))
k3 <- c(rep(NA,3),"COLD",rep(NA,16))
k4 <- c(rep(NA,3),rep("COLD",5),rep(NA,3),"COLD",rep(NA,5),rep("COLD",3))
k5 <- c(rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,8))

df <- data.frame(year=col_year,k1,k2,k3,k4,k5)

This for loop does the job:

for (col in colnames(df[2:6])) {
  i = 1
  total = 0
  vec <- c(df[[col]])
  while (i < length(vec)) {
    if (vec[i] != "COLD" & !is.na(vec[i])) {
      i <- i + 1
    } else {
      sum = 0
      COLD = TRUE
      
      while (COLD==TRUE) {
        i = i+1
        if (vec[i]=="COLD" & !is.na(vec[i])) {
          sum = sum+1
        } else {
          COLD <- FALSE
        }
      }
      
      if (sum >=3) {
        total = total + 1
      }
      
    }
  }
  print(paste0(col, ": ", total))
}

Output:

[1] "k1: 1"
[1] "k2: 0"
[1] "k3: 0"
[1] "k4: 2"
[1] "k5: 0"

Upvotes: 1

richarddmorey
richarddmorey

Reputation: 1160

Use "run length encoding" (rle) across the columns. Here's a function to do it (assuming that your data frame is df:

rle_col = function(k_col, num = 3){
  k_col[is.na(k_col)] = "NA" # convert NAs
  r = rle(k_col) # run length encoding
  which_cold = r$values == "COLD"
  sum(r$lengths[which_cold] >= num)
}

sapply(df[-1], rle_col)

Upvotes: 2

Basti
Basti

Reputation: 1763

We can use the rle function, assuming df is your dataframe :

apply(df[,2:6],2,function(x){sum(with(rle(x),lengths)[which(with(rle(x),values=="COLD"))]>=3)})

Upvotes: 1

Related Questions