Vinc
Vinc

Reputation: 41

[r]Change objects into date format: Character (0) is the outcome + How to deal with missing values

I have searched stack overflow for this problem and there are a couple of solutions but none of them work for me as most solutions only work for short and complete data sets. I have two columns that are both 481029 rows long. They both contain dates in the form of dd.mm.YYYY. One is the order date and one is the delivery date. The aim is to calculate the difference (=deliverydate - orderdate). The problem is that some of the dates are missing in both columns. So sometimes there is a delivery date but no corresponding order date and vice versa. 1. How do I handle this? 2. How can I convert the data into the date form? My results are:

ddate <- data.frame(data$deliveryDate)
> summary(ddate)
  data.deliveryDate 
?         : 39419  
2013-04-04:  5285  
2012-07-03:  5079  
1990-12-31:  4660  
2013-01-01:  4585  
2013-04-09:  4565  
(Other)   :417499  
> class(ddate)
[1] "data.frame"
> ddate <- factor()
> as.Date(ddate, format= "%d.%m.%Y")
character(0)
> mean(ddate-odate)
[1] NaN
Warning message:
In Ops.factor(ddate, odate) : ‘-’ ist nicht sinnvoll für Faktoren
> class(ddate)
[1] "factor"
> ddate[1]
[1] <NA>
as.POSIXct(ddate)
character(0)
> ddate[1]
[1] <NA>
Levels: 
> as.Date(ddate, format= "%d.%m.%Y")
character(0)
>class(odate)
[1] "factor"
> as.Date(ddate, format = "%d.%m.%Y")
character(0)
> class(ddate)
[1] "factor"
> ddate <- as.Date(ddate, format = "%d.%m.%Y")
> class(ddate)
[1] "Date"
> odate <- as.Date(odate, format = "%d.%m.%Y")
> summary(odate)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 NA      NA      NA      NA      NA      NA 
> head(odate)
character(0)
> mean(ddate-odate)
Time difference of NaN days
> ddate
character(0)
> str(ddate)
Class 'Date'  num(0) 
> str(odate)
Class 'Date'  num(0) 
> difftime(ddate,odate, units = ("days"))
Time difference of  days

I hope that this is not too messy and confusing. I just tried many things but I don't get it. The outcome is alway NA or something like this... Thank you all!

Upvotes: 0

Views: 248

Answers (2)

Vinc
Vinc

Reputation: 41

Thank you all for your help. This is what the final result looks like:

 > data<-read.table("datatxt.txt", header=TRUE, sep = ";", stringsAsFactors = F, na.strings='?') # read the file and treat all "?" as NA 
> dates <- data[,c('orderDate','deliveryDate')] # only use the orderDate and deliveryDate columns from the data set "data" 

> final_dates <- dates[complete.cases(dates),] # only use the rows which are complete in both columns
> final_dates$deliveryDate <- as.Date(final_dates$deliveryDate) #define as Date
> final_dates$orderDate <- as.Date(final_dates$orderDate) # define as Date
> deliverytime <- difftime(final_dates$deliveryDate, final_dates$orderDate, units = c("days")) # calculate the difference of the dates in days
> deliverytime_wo_1990 <- subset(deliverytime, deliverytime>0) # only show positive days. To get rid of wrong data where the deliverydate is before the orderdate. 
> mean(deliverytime_wo_1990) # finally it is possible to calculate for example the mean!

Thanks everybody. I finally made it.

Upvotes: 0

Kunal Puri
Kunal Puri

Reputation: 3427

First of all, follow lmo 's advice to read all columns as characters.

Coming to your first question, you can get rid of ? using this:

data$deliveryDate <- ifelse(data$deliveryDate == '?',NA,data$deliveryDate)

data$orderDate <- ifelse(data$orderDate == '?',NA,data$orderDate)

indices <- complete.cases(data)

final.data <- data[indices,]

If you apply this thing, only those rows will appear in the dataset final.data which have both delivery and order date.

You need not specify the format in as.Date here.

Plus if you have shown the deliveryDate values correctly, it is not of the format dd.mm.YYYY

At the end, use Phann's advice to find the difference

Update:

That's the code you are required to write:

## now, no need to check for '?' as it will be treated as NA .
data<-read.table("datatxt.txt", header=TRUE, sep = ";", stringsAsFactors = F, na.strings='?') 

final_data <- data[complete.cases(data),]

final_data$deliveryDate <- as.Date(final_data$deliveryDate)

final_data$orderDate <- as.Date(final_data$orderDate)

After this, you can apply the remaining operations.

If efficiency is required, you can use this code:

library(data.table)
## now, no need to check for '?' as it will be treated as NA .
data<-fread("datatxt.txt", na.strings='?') 

final_data <- data[complete.cases(data),]

final_data[,deliveryDate := as.Date(deliveryDate)]

final_data[,orderDate := as.Date(orderDate)]

Upvotes: 1

Related Questions