Lyndz
Lyndz

Reputation: 423

Comparing two CSV files in R with some conditions

I have two csv files:

File 1:

Year,Month,Day,Stn1,Stn2,Stn3
1979,01,01,10,0,5
1979,01,02,10,1,5,
1979,01,03,0,0,0
1979,01,04,5,10,30
1979,01,05,0,1,3

File 2:

Year,Month,Day
1979,01,02
1979,01,04
1979,01,05

File 1 contains daily data from 1979 to 2000, while File 2 contains random dates (non continuous).

What I want:

[1] Get the common dates between File 1 and File 2 if ANY of the station columns (Stn1 to 3) have values greater than or equal to 20. Then, save the output to a file.

In the above example, the output file should contain the following date:

Year,Month,Day,Stn1,Stn2,Stn3
1979,01,04,5,10,30

Since Stn 3, has a value of 30.

What I have so far:

I can get the common dates even by using a simple bash command. Unfortunately, I don't know how to filter the common dates satisfying the condition.I was wondering how to do this in R.

I'll appreciate any help on this matter.

-- Lyndz

Upvotes: 1

Views: 1931

Answers (2)

Len Greski
Len Greski

Reputation: 10855

There are many ways to do this in R. Using Base R, we can merge() the two files, using the [ form of the extract operator to include only rows from df1 where Stn3 >= 20.

 textFile1 <- "Year,Month,Day,Stn1,Stn2,Stn3
 1979, 01, 01, 10, 0, 5
 1979, 01, 02, 10, 1, 5
 1979, 01, 03, 0, 0, 0
 1979, 01, 04, 5,10,30
 1979, 01, 05, 0,1,3"

textFile2 <- "Year,Month,Day
1979,01,02
1979,01,04
1979,01,05"

df1 <- read.csv(text = textFile1,header = TRUE)
df2 <- read.csv(text = textFile2,header = TRUE)

merge(x = df1[df1$Stn3 >=20,],y = df2, by = c("Year","Month","Day"))

...and the output:

> merge(x = df1[df1$Stn3 >=20,],y = df2, by = c("Year","Month","Day"))
  Year Month Day Stn1 Stn2 Stn3
1 1979     1   4    5   10   30
> 

By slightly editing the code, we can check for any value in Stn1 to Stn3 at or above 20.

merge(x = df1[rowSums((df1[4:6] >=20))>0,],y = df2, by = c("Year","Month","Day"))

The key change is rowSums((df1[4:6] >=20))>0 in the row dimension of the extract operator for df1. If we evaluate this step by step, first df1[4:6] > 20 is evaluated, and returns a matrix of logical values.

> df1[4:6] >=20
      Stn1  Stn2  Stn3
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE  TRUE
[5,] FALSE FALSE FALSE

Next, we use rowSums() to result in a vector that has one element per row in df1, and use this to subset df1 before merging with df2.

> rowSums((df1[4:6] >=20))
[1] 0 0 0 1 0
>  

Since only 1 observation has a value > 20 for Stn1 to Stn3, we obtain the same results as illustrated above

When we alter the data to set the value of one of the variables > 20 and merge, the merge() function returns 2 rows.

# now update data to add another column > 20
df1[2,5] <- 50
merge(x = df1[rowSums((df1[4:6] >=20))>0,],y = df2, by = c("Year","Month","Day"))

...and the output:

> merge(x = df1[rowSums((df1[4:6] >=20))>0,],y = df2, by = c("Year","Month","Day"))
  Year Month Day Stn1 Stn2 Stn3
1 1979     1   2   10   50    5
2 1979     1   4    5   10   30
> 

Upvotes: 2

Alexis
Alexis

Reputation: 2294

Try this code:

library(tidyverse)
dataset <- data.frame(Year = c("1979","1979","1979","1979","1979"),
                      Month = c("01","01","01","01","01"),
                      Day = c("01","02","03","04","05"),
                      Stn1 = c(10,10,0,5,0),
                      Stn2 = c(0,1,0,10,1),
                      Stn3 = c(5,5,0,30,3),
                      stringsAsFactors = FALSE)

dataset <- dataset %>% mutate(date = paste0(Year,Month,Day))   
filterdata <- data.frame(Year = c("1979","1979","1979"),
                         Month = c("01","01","01"),
                         Day = c("02","04","05"),
                         stringsAsFactors = FALSE)
filterdata <- filterdata %>% mutate(date = paste0(Year,Month,Day))

dataset %>% semi_join(filterdata, by = 'date') %>% filter(Stn1 >= 20 | Stn2 >= 20 | Stn3 >=20) %>% select(-date) 

You can filter your data by or condition.

Regards.

Upvotes: 1

Related Questions