Reputation: 1471
Assuming we have a data frame (the third column is a date column) that contains observations of irregular events starting from Jan 2000 to Oct 2011. The goal is to choose those rows of the data frame that contain observations between two dates
start<-"2005/09/30"
end<-"2011/01/31"
The original data frame contains about 21 000 rows. We can verify this using
length(df_original$date_column)
.
We now create a new data frame that contains dates newer than the start date:
df_new<-df_original[df_original$date_column>start,]
If I check the length using length(df_new$date_column)
it shows about 13 000 for the length.
Now we create another data frame applying the second criterion (smaller than end date):
df_new2<-df_new[df_new$date_column<end,]
If I check again the length using length(df_new2$date_column)
it shows about 19 000 counts for the length.
How is it possible that by applying a second criterion on the new data frame df_new
the number of rows increases? The df_new
should have a number of rows equal or below the 13 000.
The data frame is quite large such that I cannot post it here. Maybe someone can provide a reason under which circumstances this behavior occurs.
Upvotes: 0
Views: 1299
Reputation: 60746
can you give us dput(head(df_original))
? which shares with us the first 5 records and their data structure. I am suspicious something is up with the format of your date_column
.
If you are storing start
and end
both as strings (which your example seems to indicate) and the date column is also a string, then you will not be able to use <
or >
to compare values of dates. So somewhere you need to validate that everything being compared is known by R to be dates.
Upvotes: 1
Reputation: 22588
The following example works fine for me:
df_original = data.frame(date_column = seq(as.Date('2000/01/01'), Sys.Date(), by=1), value = 1)
start = as.Date('2005/09/30')
end = as.Date('2011/01/31')
df_new = df_original[df_original$date_column>start,]
df_new2 = df_new[df_new$date_column<end,]
> dim(df_original)
[1] 4316 2
> dim(df_new)
[1] 2216 2
> dim(df_new2)
[1] 1948 2
Without seeing an example of your actual data, I would suggest 2 things to look out for:
Upvotes: 2
Reputation: 176668
Can you get the results you want via one subset command?
df_new <- df_original[with(df_original, date_column>start & date_column<end),]
# or
df_new <- subset(df_original, date_column>start & date_column<end)
Upvotes: 1