watINwatOUT
watINwatOUT

Reputation: 25

R: Trouble converting string to proper date time format after exporting data from SQL

I will admit from the start that I am new to R with my only other "programming" experience being in the MATLAB environment.

I have been running through many posts on stackoverflow related to my problem, but have not yet found a post that matches my exact problem so I chose to post it here.

Problem Definition

After exporting data (that captures information related to a measurement device) from SQL into a csv file, I imported the data into R using the read.table command as follows:

tbl = read.csv("myfile.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE);

This provides a data frame with over 17, 000 observations on 8 variables. Of these 8 variables, only the last 2 (columns) of are interest (ReadingTime and Reading) so I reduce these data frames to df as follows:

df = tbl[,c(7,8)];

For visualization purposes, here are what the first 25 elements of df look like:

    df[c(1:25),]
               ReadingTime Reading
1  2015-Dec-31 11:00:00 PM    3.52
2  2015-Dec-31 10:00:00 PM    3.97
3   2015-Dec-31 9:00:00 PM    3.85
4   2015-Dec-31 8:00:00 PM    3.94
5   2015-Dec-31 7:00:00 PM    4.47
6   2015-Dec-31 6:00:00 PM    4.75
7   2015-Dec-31 5:00:00 PM    6.58
8   2015-Dec-31 4:00:00 PM    6.99
9   2015-Dec-31 3:00:00 PM    7.50
10  2015-Dec-31 2:00:00 PM    6.28
11  2015-Dec-31 1:00:00 PM    6.16
12 2015-Dec-31 12:00:00 PM    4.49
13 2015-Dec-31 11:00:00 AM    4.30
14 2015-Dec-31 10:00:00 AM    4.27
15  2015-Dec-31 9:00:00 AM    4.54
16  2015-Dec-31 8:00:00 AM    4.30
17  2015-Dec-31 7:00:00 AM    4.52
18  2015-Dec-31 6:00:00 AM    4.65
19  2015-Dec-31 5:00:00 AM    4.25
20  2015-Dec-31 4:00:00 AM    4.45
21  2015-Dec-31 3:00:00 AM    4.26
22  2015-Dec-31 2:00:00 AM    5.02
23  2015-Dec-31 1:00:00 AM    5.17
24             2015-Dec-31    5.44
25 2015-Dec-30 11:00:00 PM    5.53

Objective

I would like to now convert df into an xts object with proper date time format so that I may create summary statistics and perform transformation on my data (e.g. convert the hourly time series into daily, weekly, etc. time series), and ultimately use the xts object in a forecasting exercise.

Difficulties Encountered

When attempting to convert the ReadingTime from df (i.e. date time in character format) into a date time format recognized by xts I run into an issue with the date times that occur at midnight. An example is as follows:

strptime(df[,1], "%Y-%b-%d %H:%M:%S %p",tz="GMT");
  df[c(1:25),1]
 [1] "2015-12-31 11:00:00 GMT" "2015-12-31 10:00:00 GMT" "2015-12-31 09:00:00 GMT"
 [4] "2015-12-31 08:00:00 GMT" "2015-12-31 07:00:00 GMT" "2015-12-31 06:00:00 GMT"
 [7] "2015-12-31 05:00:00 GMT" "2015-12-31 04:00:00 GMT" "2015-12-31 03:00:00 GMT"
[10] "2015-12-31 02:00:00 GMT" "2015-12-31 01:00:00 GMT" "2015-12-31 12:00:00 GMT"
[13] "2015-12-31 11:00:00 GMT" "2015-12-31 10:00:00 GMT" "2015-12-31 09:00:00 GMT"
[16] "2015-12-31 08:00:00 GMT" "2015-12-31 07:00:00 GMT" "2015-12-31 06:00:00 GMT"
[19] "2015-12-31 05:00:00 GMT" "2015-12-31 04:00:00 GMT" "2015-12-31 03:00:00 GMT"
[22] "2015-12-31 02:00:00 GMT" "2015-12-31 01:00:00 GMT" NA                       
[25] "2015-12-30 11:00:00 GMT"

Questions:

My three questions are as follows: 1) how come AM/PM is not being maintained - how can it be fixed (this has been addressed by @HubertL below)?; 2) how do I overcome the NA at [24], and convert it to its proper format?; AND 3) how can I turn df into an xts object?

Solutions Proposed by @HubertL So far @HubertL has addressed Q1. The first part of Answer 2 (A2) as proposed by @HubertL separates the various components of the ReadingTime and adds another column to df which looks like:

> df[c(1:25),]
               ReadingTime Reading                 dateSplit
1  2015-Dec-31 11:00:00 PM    3.52 2015-Dec-31, 11:00:00, PM
2  2015-Dec-31 10:00:00 PM    3.97 2015-Dec-31, 10:00:00, PM
3   2015-Dec-31 9:00:00 PM    3.85  2015-Dec-31, 9:00:00, PM
4   2015-Dec-31 8:00:00 PM    3.94  2015-Dec-31, 8:00:00, PM
5   2015-Dec-31 7:00:00 PM    4.47  2015-Dec-31, 7:00:00, PM
6   2015-Dec-31 6:00:00 PM    4.75  2015-Dec-31, 6:00:00, PM
7   2015-Dec-31 5:00:00 PM    6.58  2015-Dec-31, 5:00:00, PM
8   2015-Dec-31 4:00:00 PM    6.99  2015-Dec-31, 4:00:00, PM
9   2015-Dec-31 3:00:00 PM    7.50  2015-Dec-31, 3:00:00, PM
10  2015-Dec-31 2:00:00 PM    6.28  2015-Dec-31, 2:00:00, PM
11  2015-Dec-31 1:00:00 PM    6.16  2015-Dec-31, 1:00:00, PM
12 2015-Dec-31 12:00:00 PM    4.49 2015-Dec-31, 12:00:00, PM
13 2015-Dec-31 11:00:00 AM    4.30 2015-Dec-31, 11:00:00, AM
14 2015-Dec-31 10:00:00 AM    4.27 2015-Dec-31, 10:00:00, AM
15  2015-Dec-31 9:00:00 AM    4.54  2015-Dec-31, 9:00:00, AM
16  2015-Dec-31 8:00:00 AM    4.30  2015-Dec-31, 8:00:00, AM
17  2015-Dec-31 7:00:00 AM    4.52  2015-Dec-31, 7:00:00, AM
18  2015-Dec-31 6:00:00 AM    4.65  2015-Dec-31, 6:00:00, AM
19  2015-Dec-31 5:00:00 AM    4.25  2015-Dec-31, 5:00:00, AM
20  2015-Dec-31 4:00:00 AM    4.45  2015-Dec-31, 4:00:00, AM
21  2015-Dec-31 3:00:00 AM    4.26  2015-Dec-31, 3:00:00, AM
22  2015-Dec-31 2:00:00 AM    5.02  2015-Dec-31, 2:00:00, AM
23  2015-Dec-31 1:00:00 AM    5.17  2015-Dec-31, 1:00:00, AM
24             2015-Dec-31    5.44               2015-Dec-31
25 2015-Dec-30 11:00:00 PM    5.53 2015-Dec-30, 11:00:00, PM

Now when running the second line of code for A2 I run into the problem that the suggested lengths function does not exist for my version of R (3.1.1), so I replaced by the length function, is this okay to do? Anyway the result of running the second and third lines of code for A2 is as follows:

> df[c(1:25),]
               ReadingTime Reading                 dateSplit                date
1  2015-Dec-31 11:00:00 PM    3.52 2015-Dec-31, 11:00:00, PM 2015-12-31 23:00:00
2  2015-Dec-31 10:00:00 PM    3.97 2015-Dec-31, 10:00:00, PM 2015-12-31 22:00:00
3   2015-Dec-31 9:00:00 PM    3.85  2015-Dec-31, 9:00:00, PM 2015-12-31 21:00:00
4   2015-Dec-31 8:00:00 PM    3.94  2015-Dec-31, 8:00:00, PM 2015-12-31 20:00:00
5   2015-Dec-31 7:00:00 PM    4.47  2015-Dec-31, 7:00:00, PM 2015-12-31 19:00:00
6   2015-Dec-31 6:00:00 PM    4.75  2015-Dec-31, 6:00:00, PM 2015-12-31 18:00:00
7   2015-Dec-31 5:00:00 PM    6.58  2015-Dec-31, 5:00:00, PM 2015-12-31 17:00:00
8   2015-Dec-31 4:00:00 PM    6.99  2015-Dec-31, 4:00:00, PM 2015-12-31 16:00:00
9   2015-Dec-31 3:00:00 PM    7.50  2015-Dec-31, 3:00:00, PM 2015-12-31 15:00:00
10  2015-Dec-31 2:00:00 PM    6.28  2015-Dec-31, 2:00:00, PM 2015-12-31 14:00:00
11  2015-Dec-31 1:00:00 PM    6.16  2015-Dec-31, 1:00:00, PM 2015-12-31 13:00:00
12 2015-Dec-31 12:00:00 PM    4.49 2015-Dec-31, 12:00:00, PM 2015-12-31 12:00:00
13 2015-Dec-31 11:00:00 AM    4.30 2015-Dec-31, 11:00:00, AM 2015-12-31 11:00:00
14 2015-Dec-31 10:00:00 AM    4.27 2015-Dec-31, 10:00:00, AM 2015-12-31 10:00:00
15  2015-Dec-31 9:00:00 AM    4.54  2015-Dec-31, 9:00:00, AM 2015-12-31 09:00:00
16  2015-Dec-31 8:00:00 AM    4.30  2015-Dec-31, 8:00:00, AM 2015-12-31 08:00:00
17  2015-Dec-31 7:00:00 AM    4.52  2015-Dec-31, 7:00:00, AM 2015-12-31 07:00:00
18  2015-Dec-31 6:00:00 AM    4.65  2015-Dec-31, 6:00:00, AM 2015-12-31 06:00:00
19  2015-Dec-31 5:00:00 AM    4.25  2015-Dec-31, 5:00:00, AM 2015-12-31 05:00:00
20  2015-Dec-31 4:00:00 AM    4.45  2015-Dec-31, 4:00:00, AM 2015-12-31 04:00:00
21  2015-Dec-31 3:00:00 AM    4.26  2015-Dec-31, 3:00:00, AM 2015-12-31 03:00:00
22  2015-Dec-31 2:00:00 AM    5.02  2015-Dec-31, 2:00:00, AM 2015-12-31 02:00:00
23  2015-Dec-31 1:00:00 AM    5.17  2015-Dec-31, 1:00:00, AM 2015-12-31 01:00:00
24             2015-Dec-31    5.44               2015-Dec-31                <NA>
25 2015-Dec-30 11:00:00 PM    5.53 2015-Dec-30, 11:00:00, PM 2015-12-30 23:00:00

You can see the NA still exists for [24]. This issue results in all measurements that occur at midnight being disclosed from the xts object when I apply the code suggested for A3. I.e.:

> df[c(1:25),]
                    [,1]
2014-01-01 01:00:00 4.67
2014-01-01 02:00:00 4.78
2014-01-01 03:00:00 4.87
2014-01-01 04:00:00 4.61
2014-01-01 05:00:00 4.58
2014-01-01 06:00:00 4.47
2014-01-01 07:00:00 4.66
2014-01-01 08:00:00 4.46
2014-01-01 09:00:00 4.57
2014-01-01 10:00:00 4.87
2014-01-01 11:00:00 4.57
2014-01-01 12:00:00 4.67
2014-01-01 13:00:00 5.52
2014-01-01 14:00:00 6.42
2014-01-01 15:00:00 6.79
2014-01-01 16:00:00 6.50
2014-01-01 17:00:00 5.81
2014-01-01 18:00:00 5.65
2014-01-01 19:00:00 6.25
2014-01-01 20:00:00 5.79
2014-01-01 21:00:00 5.84
2014-01-01 22:00:00 6.06
2014-01-01 23:00:00 4.74
2014-01-02 01:00:00 4.66
2014-01-02 02:00:00 5.59

Any help clearing up these last couple of issues would be more than appreciated!

Upvotes: 1

Views: 250

Answers (1)

HubertL
HubertL

Reputation: 19544

Answer 1: use %I instead of %H

test = strptime(..., "%Y-%b-%d %I:%M:%S %p",tz="GMT")

Answer 2:

df$dateSplit <- strsplit( df$ReadingTime, " ")
df[lengths(df$dateSplit)<3,"ReadingTime"] <- 
    format(
        strptime(df$date[lengths(df$dateSplit)<3], "%Y-%b-%d", tz="GMT"),
        "%Y-%b-%d %I:%M:%S %p", tz="GMT")
df$date <- strptime(df$ReadingTime, "%Y-%b-%d %I:%M:%S %p", tz="GMT")

Answer 3:

require(xts)
xts(df$Reading,df$date)

Upvotes: 2

Related Questions