Javier_Caceres
Javier_Caceres

Reputation: 48

R Have more missing rows ( or less rows ) than it should after selecting according to date

I am working on a Data Frame that consists of past baseball players with playing records from 1871 up to 2016. The DF name is Master.new What I'm trying to do is just get those that debuted after 1903.( "debut" column name ) After running the code to delete unwanted rows, the new DF erased more than it should and I know because of the summary command.

What I have done is

  1. Make sure the class for the "debut" column is "Date"

    sapply(Master.new, class)
    

and this is the result

   playerID   birthYear       debut   finalGame 
"character"   "integer"      "Date"      "Date" 
  1. Run the summary command to check the structure and here I can see the range of dates and verified the first date is in the year 1871 ( 19105 observations)

    summary(Master.new)

and this is the result

 playerID           birthYear        debut              finalGame
Length:19105 Min. :1820 Min. :1871-05-04 Min. :1871-05-05
Class :character 1st Qu.:1895 1st Qu.:1919-04-24 1st Qu.:1923-04-29
Mode :character Median :1937 Median :1961-06-09 Median :1966-10-02
Mean :1931 Mean :1956-02-23 Mean :1960-12-20
3rd Qu.:1969 3rd Qu.:1995-04-26 3rd Qu.:2000-09-29
Max. :1996 Max. :2016-10-02 Max. :2016-10-02
NA's :132 NA's :195 NA's :195

3.I ran the command to choose only the record of the variable "debut" which are greater than, or past 1903-01-01 creating a new DF called Master.new.debut which has 7899 observation, less than the 19105 of the Master.new which seems logical because I am eliminating rows from years prior to 1903.

    Master.new.debut <- Master.new[Master.new $debut >= 1903-01-01,]
  1. I then ran the summary command on the new Data Frame Master.new.debut

    summary(Master.new.debut)
    

Below is what I received. I expected to confirm the first records had to be in the year 1903. What I got are records where the Min. value is in year 1975. The help I need is figuring out why isn't my first record in the year 1903, and what happened to all records between 1903 and 1975.

Thank you Javier

 
 playerID           birthYear        debut              finalGame         
 Length:7899        Min.   :1946   Min.   :1975-04-07   Min.   :1975-04-21  
 Class :character   1st Qu.:1964   1st Qu.:1988-09-02   1st Qu.:1994-07-31  
 Mode  :character   Median :1974   Median :1999-05-14   Median :2005-04-21  
                    Mean   :1973   Mean   :1998-04-21   Mean   :2003-04-21  
                    3rd Qu.:1983   3rd Qu.:2008-07-13   3rd Qu.:2013-09-29  
                    Max.   :1996   Max.   :2016-10-02   Max.   :2016-10-02  
                    NA's   :195    NA's   :195          NA's   :195  

Upvotes: 1

Views: 60

Answers (1)

MKR
MKR

Reputation: 20095

The problem is in below line:

 Master.new.debut <- Master.new[Master.new $debut >= 1903-01-01,]

The reason is that 1903-01-01 part is treated as absolute number and since date too is stored as offset from 1970-01-01. Hence, >= operation is comparing offset values of date. The 1901 is approximately equivalent to 5 years. Hence, its displaying dates after 1975.

Please change the line to:

Master.new.debut <- Master.new[Master.new $debut >= '1903-01-01',]

to force a date comparison. You can even use as.Date() function to convert your string literal in date.

Upvotes: 0

Related Questions