John
John

Reputation: 1471

R data frame - row number increases

Assuming we have a data frame (the third column is a date column) that contains observations of irregular events starting from Jan 2000 to Oct 2011. The goal is to choose those rows of the data frame that contain observations between two dates

start<-"2005/09/30"
end<-"2011/01/31"  

The original data frame contains about 21 000 rows. We can verify this using length(df_original$date_column).

We now create a new data frame that contains dates newer than the start date:

df_new<-df_original[df_original$date_column>start,]

If I check the length using length(df_new$date_column) it shows about 13 000 for the length.

Now we create another data frame applying the second criterion (smaller than end date):

df_new2<-df_new[df_new$date_column<end,]

If I check again the length using length(df_new2$date_column) it shows about 19 000 counts for the length.

How is it possible that by applying a second criterion on the new data frame df_new the number of rows increases? The df_new should have a number of rows equal or below the 13 000.

The data frame is quite large such that I cannot post it here. Maybe someone can provide a reason under which circumstances this behavior occurs.

Upvotes: 0

Views: 1299

Answers (3)

JD Long
JD Long

Reputation: 60746

can you give us dput(head(df_original))? which shares with us the first 5 records and their data structure. I am suspicious something is up with the format of your date_column.

If you are storing start and end both as strings (which your example seems to indicate) and the date column is also a string, then you will not be able to use < or > to compare values of dates. So somewhere you need to validate that everything being compared is known by R to be dates.

Upvotes: 1

John Colby
John Colby

Reputation: 22588

The following example works fine for me:

df_original = data.frame(date_column = seq(as.Date('2000/01/01'), Sys.Date(), by=1), value = 1)

start = as.Date('2005/09/30')
end   = as.Date('2011/01/31')

df_new = df_original[df_original$date_column>start,]

df_new2 = df_new[df_new$date_column<end,]
> dim(df_original)
[1] 4316    2
> dim(df_new)
[1] 2216    2
> dim(df_new2)
[1] 1948    2

Without seeing an example of your actual data, I would suggest 2 things to look out for:

  1. Make sure your dates are coded as dates.
  2. Make sure you aren't accidentally indexing by row name. This is a common culprit for the behavior you're talking about.

Upvotes: 2

Joshua Ulrich
Joshua Ulrich

Reputation: 176668

Can you get the results you want via one subset command?

df_new <- df_original[with(df_original, date_column>start & date_column<end),]
# or
df_new <- subset(df_original, date_column>start & date_column<end)

Upvotes: 1

Related Questions