user10394751
user10394751

Reputation:

Create Uniform Date across Varying Formats

I have a data set with about 53,000 rows, and one of the columns measures the date and time the observation was taken at. During the data collection, somehow, the format of the date switched from mm/dd/year to dd/mm/year, and this happened several times. However, I know the exact row where each of these errors occurred, so I was wondering if there is an efficient way to change them all to a single format (I want them in mm/dd/year since that is what most of the observations are in).

Here is an example of the problem I am talking about, going from dd/mm/year to mm/dd/year:

data.test = data.frame(Date = c("16/11/2017 8:45 AM", "16/11/2017 9:00 AM", 
"11/16/2017 9:15 AM", "11/16/2017 9:30 AM"), Observed = c(100, 23, 291, 30))
data.test

Date                Observed
16/11/2017 8:45 AM  100         
16/11/2017 9:00 AM  23          
11/16/2017 9:15 AM  291         
11/16/2017 9:30 AM  30  

In my data set, the dates are characters, while the rest of the rows are numbers.

Upvotes: 0

Views: 855

Answers (3)

Chabo
Chabo

Reputation: 3000

Here is an interesting attempt at this. Assuming that your first observation is in the correct format, a bit of code is worked in to check for the most plausible date format based on what month the date is supposed to be in. How it deals with months changing I am not entirely sure yet.

The thought is as long as the first observation is correct it will carry forward in the right format until an ambiguous date is hit. At that point it is checked against a previous observation's month that is supposed to be right. If they match then the current ambiguous observation was predicted correctly, if not it selects the other format in place. Again when months switch and an ambiguous hits needs some work but I am too lazy (it's Friday)

data.test = data.frame(Date = c("9/8/2017 8:30 AM","8/9/2017 8:45 AM", "16/11/2017 9:00 AM", "11/16/2017 9:15 AM", "11/16/2017 9:30 AM"), Observed = c(100, 23, 291, 30, 45))

Date1<-
as.POSIXct(data.test$Date, format="%m/%d/%Y %H:%M %p") # search for format1
Date2<-
as.POSIXct(data.test$Date, format="%d/%m/%Y %H:%M %p") # search for format2

# Replace data.test Date Column with Date1, leaving NAs
data.test$Date<-Date1


#Check most plausible date format.
for(i in 1:length(Date1)){
  if(is.na(Date1[i])==F  && is.na(Date2[i])==F && i!= 1){
    print(paste("row",i , "ambigious format"))
      if(month(Date1[i-1])==month(Date1[i])){
         print("Date Estimated from mm/dd/YYYY format based on previous")
      }else{
         Date1[i]<-Date2[i]
         print("Date Estimated from dd/mm/YYYY format based on previous")
}
}else{}
}



# Replace NAs in data.test with index from Date2, line up
data.test$Date[is.na(data.test$Date)] <- Date2[is.na(data.test$Date)]

> data.test
                 Date Observed
1 2017-09-08 08:30:00      100
2 2017-09-08 08:45:00       23
3 2017-11-16 09:00:00      291
4 2017-11-16 09:15:00       30
5 2017-11-16 09:30:00       45

Upvotes: 0

MKR
MKR

Reputation: 20095

I'll prefer to use lubridate::parse_date_time function in such case to handle heterogeneous format of date/time in same column

# Sample data.frame
# I have modified one date from sample used by OP to include both AM and PM 
data.test = data.frame(Date = c("16/11/2017 8:45 AM", "16/11/2017 9:00 PM", 
               "11/16/2017 9:15 AM", "11/16/2017 9:30 AM"), Observed = c(100, 23, 291, 30))

#modified column added in consistent format. 
library(lubridate)
data.test$modifeddatetime <- parse_date_time(data.test$Date, c("dmY HM p", "mdY HM p"))

#Change the modfieddatetime column back in character in desired format
data.test$modifeddatetime <- format(data.test$modifeddatetime, format="%m/%d/%Y %H-%M")


data.test
#                 Date Observed  modifeddatetime
# 1 16/11/2017 8:45 AM      100 11/16/2017 08-45
# 2 16/11/2017 9:00 PM       23 11/16/2017 21-00
# 3 11/16/2017 9:15 AM      291 11/16/2017 09-15
# 4 11/16/2017 9:30 AM       30 11/16/2017 09-30

Upvotes: 1

IRTFM
IRTFM

Reputation: 263441

You say you know the location (presumably a range of row numbers) where the changes need to occur, but you didn't give those in your example, possibly because it was obvious. But it would not be obvious in cases where both mm and dd were less than 13.

data.test$Date <- as.character(data.test$Date) # prevent factor issues
dd_mm <- 1:2 # the rows to be changed
repl <-format( as.POSIXct( data.test$Date[dd_mm], format="%d/%m/%Y %H:%M %p"), format="%m/%d/%Y %H:%M %p" )
data.test$Date[dd_mm] <- repl
data.test
#-------------------------------------------
                 Date Observed
1 11/16/2017 08:45 AM      100
2 11/16/2017 09:00 AM       23
3  11/16/2017 9:15 AM      291
4  11/16/2017 9:30 AM       30

Upvotes: 0

Related Questions