Reputation: 48
I am working on a Data Frame that consists of past baseball players with playing records from 1871 up to 2016. The DF name is Master.new What I'm trying to do is just get those that debuted after 1903.( "debut" column name ) After running the code to delete unwanted rows, the new DF erased more than it should and I know because of the summary command.
What I have done is
Make sure the class for the "debut" column is "Date"
sapply(Master.new, class)
and this is the result
playerID birthYear debut finalGame "character" "integer" "Date" "Date"
Run the summary command to check the structure and here I can see the range of dates and verified the first date is in the year 1871 ( 19105 observations)
summary(Master.new)
and this is the result
playerID birthYear debut finalGame
Length:19105 Min. :1820 Min. :1871-05-04 Min. :1871-05-05
Class :character 1st Qu.:1895 1st Qu.:1919-04-24 1st Qu.:1923-04-29
Mode :character Median :1937 Median :1961-06-09 Median :1966-10-02
Mean :1931 Mean :1956-02-23 Mean :1960-12-20
3rd Qu.:1969 3rd Qu.:1995-04-26 3rd Qu.:2000-09-29
Max. :1996 Max. :2016-10-02 Max. :2016-10-02
NA's :132 NA's :195 NA's :195
3.I ran the command to choose only the record of the variable "debut" which are greater than, or past 1903-01-01 creating a new DF called Master.new.debut which has 7899 observation, less than the 19105 of the Master.new which seems logical because I am eliminating rows from years prior to 1903.
Master.new.debut <- Master.new[Master.new $debut >= 1903-01-01,]
I then ran the summary command on the new Data Frame Master.new.debut
summary(Master.new.debut)
Below is what I received. I expected to confirm the first records had to be in the year 1903. What I got are records where the Min. value is in year 1975. The help I need is figuring out why isn't my first record in the year 1903, and what happened to all records between 1903 and 1975.
Thank you Javier
playerID birthYear debut finalGame Length:7899 Min. :1946 Min. :1975-04-07 Min. :1975-04-21 Class :character 1st Qu.:1964 1st Qu.:1988-09-02 1st Qu.:1994-07-31 Mode :character Median :1974 Median :1999-05-14 Median :2005-04-21 Mean :1973 Mean :1998-04-21 Mean :2003-04-21 3rd Qu.:1983 3rd Qu.:2008-07-13 3rd Qu.:2013-09-29 Max. :1996 Max. :2016-10-02 Max. :2016-10-02 NA's :195 NA's :195 NA's :195
Upvotes: 1
Views: 60
Reputation: 20095
The problem is in below line:
Master.new.debut <- Master.new[Master.new $debut >= 1903-01-01,]
The reason is that 1903-01-01
part is treated as absolute number and since date
too is stored as offset from 1970-01-01. Hence, >=
operation is comparing offset values of date. The 1901
is approximately equivalent to 5 years. Hence, its displaying dates after 1975
.
Please change the line to:
Master.new.debut <- Master.new[Master.new $debut >= '1903-01-01',]
to force a date comparison. You can even use as.Date()
function to convert your string literal in date.
Upvotes: 0