glor
glor

Reputation: 109

Extract time stamps from string and convert to R POSIXct object

Currently, my dataset has a time variable (factor) in the following format:

weekday month day hour min seconds +0000 year

I don't know what the "+0000" field is but all observations have this. For example:

"Tues Feb 02 11:05:21 +0000 2018" 
"Mon Jun 12 06:21:50 +0000 2017"
"Wed Aug 01 11:24:08 +0000 2018"

I want to convert these values to POSIXlt or POSIXct objects(year-month-day hour:min:sec) and make them numeric. Currently, using as.numeric(as.character(time-variable)) outputs incorrect values.

Thank you for the great responses! I really appreciate a lot.

Upvotes: 0

Views: 438

Answers (2)

Gautam
Gautam

Reputation: 2753

For this problem you can get by without using lubridate. First, to extract individual dates we can use regmatches and gregexpr:

date_char <- 'Tue Feb 02 11:05:21 +0000 2018 Mon Jun 12 06:21:50 +0000 2017'
ptrn <- '([[:alpha:]]{3} [[:alpha:]]{3} [[:digit:]]{2} [[:digit:]]{2}\\:[[:digit:]]{2}\\:[[:digit:]]{2} \\+[[:digit:]]{4} [[:digit:]]{4})'
date_vec <- unlist( regmatches(date_char, gregexpr(ptrn, date_char)))

> date_vec
[1] "Tue Feb 02 11:05:21 +0000 2018" "Mon Jun 12 06:21:50 +0000 2017"

You can learn more about regular expressions here.

In the above example +0000 field is the UTC offset in hours e.g. it would be -0500 for EST timezone. To convert to R date-time object:

> as.POSIXct(date_vec, format = '%a %b %d %H:%M:%S %z %Y', tz = 'UTC')
[1] "2018-02-02 11:05:21 UTC" "2017-06-12 06:21:50 UTC"

which is the desired output. The formats can be found here or you can use lubridate::guess_formats(). If you don't specify the tz, you'll get the output in your system's time zone (e.g. for me that would be EST). Since the offset is specified in the format, R correctly carries out the conversion.

To get numeric values, the following works:

> as.numeric(as.POSIXct(date_vec, format = '%a %b %d %H:%M:%S %z %Y', tz = 'UTC'))
[1] 1517569521 1497248510

Note: this is based on uniform string structure. In the OP there was Tues instead of Tue which wouldn't work. The above example is based on the three-letter abbreviation which is the standard reporting format.

If however, your data is a mix of different formats, you'd have to extract individual time strings (customized regexes, of course), then use lubridate::guess_formats() to get the formats and then use those to carry out the conversion.

Hope this is helpful!!

Upvotes: 0

Pavel Paltsev
Pavel Paltsev

Reputation: 168

Not sure how to reproduce the transition from factor to char, but starting from that this code should work:

t <- unlist(strsplit(as.character("Tues Feb 02 11:05:21 +0000 2018")," "))
strptime(paste(t[6],t[2],t[3], t[4]),format='%Y %b %d %H:%M:%S')

PS: More on date formats and conversion: https://www.stat.berkeley.edu/~s133/dates.html

Upvotes: 1

Related Questions