Reputation:
I have a data set with about 53,000 rows, and one of the columns measures the date and time the observation was taken at. During the data collection, somehow, the format of the date switched from mm/dd/year to dd/mm/year, and this happened several times. However, I know the exact row where each of these errors occurred, so I was wondering if there is an efficient way to change them all to a single format (I want them in mm/dd/year since that is what most of the observations are in).
Here is an example of the problem I am talking about, going from dd/mm/year to mm/dd/year:
data.test = data.frame(Date = c("16/11/2017 8:45 AM", "16/11/2017 9:00 AM",
"11/16/2017 9:15 AM", "11/16/2017 9:30 AM"), Observed = c(100, 23, 291, 30))
data.test
Date Observed
16/11/2017 8:45 AM 100
16/11/2017 9:00 AM 23
11/16/2017 9:15 AM 291
11/16/2017 9:30 AM 30
In my data set, the dates are characters, while the rest of the rows are numbers.
Upvotes: 0
Views: 855
Reputation: 3000
Here is an interesting attempt at this. Assuming that your first observation is in the correct format, a bit of code is worked in to check for the most plausible date format based on what month the date is supposed to be in. How it deals with months changing I am not entirely sure yet.
The thought is as long as the first observation is correct it will carry forward in the right format until an ambiguous date is hit. At that point it is checked against a previous observation's month that is supposed to be right. If they match then the current ambiguous observation was predicted correctly, if not it selects the other format in place. Again when months switch and an ambiguous hits needs some work but I am too lazy (it's Friday)
data.test = data.frame(Date = c("9/8/2017 8:30 AM","8/9/2017 8:45 AM", "16/11/2017 9:00 AM", "11/16/2017 9:15 AM", "11/16/2017 9:30 AM"), Observed = c(100, 23, 291, 30, 45))
Date1<-
as.POSIXct(data.test$Date, format="%m/%d/%Y %H:%M %p") # search for format1
Date2<-
as.POSIXct(data.test$Date, format="%d/%m/%Y %H:%M %p") # search for format2
# Replace data.test Date Column with Date1, leaving NAs
data.test$Date<-Date1
#Check most plausible date format.
for(i in 1:length(Date1)){
if(is.na(Date1[i])==F && is.na(Date2[i])==F && i!= 1){
print(paste("row",i , "ambigious format"))
if(month(Date1[i-1])==month(Date1[i])){
print("Date Estimated from mm/dd/YYYY format based on previous")
}else{
Date1[i]<-Date2[i]
print("Date Estimated from dd/mm/YYYY format based on previous")
}
}else{}
}
# Replace NAs in data.test with index from Date2, line up
data.test$Date[is.na(data.test$Date)] <- Date2[is.na(data.test$Date)]
> data.test
Date Observed
1 2017-09-08 08:30:00 100
2 2017-09-08 08:45:00 23
3 2017-11-16 09:00:00 291
4 2017-11-16 09:15:00 30
5 2017-11-16 09:30:00 45
Upvotes: 0
Reputation: 20095
I'll prefer to use lubridate::parse_date_time
function in such case to handle heterogeneous format of date/time in same column
# Sample data.frame
# I have modified one date from sample used by OP to include both AM and PM
data.test = data.frame(Date = c("16/11/2017 8:45 AM", "16/11/2017 9:00 PM",
"11/16/2017 9:15 AM", "11/16/2017 9:30 AM"), Observed = c(100, 23, 291, 30))
#modified column added in consistent format.
library(lubridate)
data.test$modifeddatetime <- parse_date_time(data.test$Date, c("dmY HM p", "mdY HM p"))
#Change the modfieddatetime column back in character in desired format
data.test$modifeddatetime <- format(data.test$modifeddatetime, format="%m/%d/%Y %H-%M")
data.test
# Date Observed modifeddatetime
# 1 16/11/2017 8:45 AM 100 11/16/2017 08-45
# 2 16/11/2017 9:00 PM 23 11/16/2017 21-00
# 3 11/16/2017 9:15 AM 291 11/16/2017 09-15
# 4 11/16/2017 9:30 AM 30 11/16/2017 09-30
Upvotes: 1
Reputation: 263441
You say you know the location (presumably a range of row numbers) where the changes need to occur, but you didn't give those in your example, possibly because it was obvious. But it would not be obvious in cases where both mm and dd were less than 13.
data.test$Date <- as.character(data.test$Date) # prevent factor issues
dd_mm <- 1:2 # the rows to be changed
repl <-format( as.POSIXct( data.test$Date[dd_mm], format="%d/%m/%Y %H:%M %p"), format="%m/%d/%Y %H:%M %p" )
data.test$Date[dd_mm] <- repl
data.test
#-------------------------------------------
Date Observed
1 11/16/2017 08:45 AM 100
2 11/16/2017 09:00 AM 23
3 11/16/2017 9:15 AM 291
4 11/16/2017 9:30 AM 30
Upvotes: 0