bikingman
bikingman

Reputation: 191

How to extract section of string from each string in factor vector in R?

I have a column with time stamps (that are classed as factors at the moment), that I'd like to extract from. There are about 220,000 rows and about half contain factors as "%d.%m.%Y %H:%M:%S" and the other half, "%d.%m.%Y %H:%M". So about half have three more charters than the other half.

I'd like to extract from each row the, %d.%m.%Y %H:%M", leaving out the :%S from those that contain it.

Since they're classed as factors, my assumption is that they could be extracted by order, i.e. extract 1st - 14th charcter.

This is what the data looks like.

Start.Timestamp
01.01.2015 12:32
01.01.2015 08:22
01.01.2015 14:10
31.12.2014 21:43
01.01.2015 00:21
01.01.2015 12:38
01.01.2015 01:00
01.01.2015 14:13
01.01.2016 04:11:34
01.01.2016 10:13:30
01.01.2016 04:30:08
01.01.2016 08:49:40
01.01.2016 07:44:45

Also - all ":00" to those factors strings missing the "%S" would be acceptable.

I hope this is clear.

Thank you all in advance

Upvotes: 0

Views: 983

Answers (4)

Abdou
Abdou

Reputation: 13274

Given that you have a factor column, I would recommend converting it to a vector of characters and using both strptime() and strftime() to get the desired output in a datetime format:

Your vector in a data.frame

your_df <- structure(list(Start.Timestamp = structure(c(4L, 3L, 6L, 13L, 
1L, 5L, 2L, 7L, 8L, 12L, 9L, 11L, 10L), .Label = c("01.01.2015 00:21", 
"01.01.2015 01:00", "01.01.2015 08:22", "01.01.2015 12:32", "01.01.2015 12:38", 
"01.01.2015 14:10", "01.01.2015 14:13", "01.01.2016 04:11:34", 
"01.01.2016 04:30:08", "01.01.2016 07:44:45", "01.01.2016 08:49:40", 
"01.01.2016 10:13:30", "31.12.2014 21:43"), class = "factor")), .Names = "Start.Timestamp", class = "data.frame", row.names = c(NA, 
-13L))

Getting your desired output:

strftime(strptime(as.character(your_df$Start.Timestamp), format = "%d.%m.%Y %H:%M"), "%d.%m.%Y %H:%M")

Output

 [1] "01.01.2015 12:32" "01.01.2015 08:22" "01.01.2015 14:10" "31.12.2014 21:43" "01.01.2015 00:21"
 [6] "01.01.2015 12:38" "01.01.2015 01:00" "01.01.2015 14:13" "01.01.2016 04:11" "01.01.2016 10:13"
[11] "01.01.2016 04:30" "01.01.2016 08:49" "01.01.2016 07:44"

Upvotes: 1

RHertel
RHertel

Reputation: 23808

We can use lubridate's dmy_hms() function with the option truncated = 1 to generate a POSIXct object. This option is helpful when the time data has incomplete entries, like missing seconds in this case (which are then set to 00).

The output of dmy_hms() can then be wrapped into format() to obtain the desired form:

format(lubridate::dmy_hms(df1$Start.Timestamp, truncated = 1),"%d.%m.%Y %H:%M")
# [1] "01.01.2015 12:32" "01.01.2015 08:22" "01.01.2015 14:10" "31.12.2014 21:43"
# [5] "01.01.2015 00:21" "01.01.2015 12:38" "01.01.2015 01:00" "01.01.2015 14:13"
# [9] "01.01.2016 04:11" "01.01.2016 10:13" "01.01.2016 04:30" "01.01.2016 08:49"
#[13] "01.01.2016 07:44"

data

df1 <- structure(list(Start.Timestamp = structure(c(4L, 3L, 6L, 13L, 
1L, 5L, 2L, 7L, 8L, 12L, 9L, 11L, 10L), .Label = c("01.01.2015 00:21", 
"01.01.2015 01:00", "01.01.2015 08:22", "01.01.2015 12:32", "01.01.2015 12:38", 
"01.01.2015 14:10", "01.01.2015 14:13", "01.01.2016 04:11:34", 
"01.01.2016 04:30:08", "01.01.2016 07:44:45", "01.01.2016 08:49:40", 
"01.01.2016 10:13:30", "31.12.2014 21:43"), class = "factor")), 
.Names = "Start.Timestamp", class = "data.frame", row.names = c(NA, -13L))

Upvotes: 1

Christoph
Christoph

Reputation: 7063

It depends on the given format (dput a sample of your data). One possibility was

> str <- c("01.01.2016 07:44", "01.01.2016 07:45")
> substr(str, 1,16)
[1] "01.01.2016 07:44" "01.01.2016 07:45"

if truncation is ok.

Upvotes: 0

Benjamin Mohn
Benjamin Mohn

Reputation: 301

Depending on what your initial data is, something like:

lapply(df,substring, first=1, last=16)

could maybe help.

When your data is like this:

df <- data.frame("Start.Timestamp",
             "01.01.2015 12:32",
             "01.01.2015 08:22",
             "01.01.2015 14:10",
             "31.12.2014 21:43",
             "01.01.2015 00:21",
             "01.01.2015 12:38",
             "01.01.2015 01:00",
             "01.01.2015 14:13",
             "01.01.2016 04:11:34",
             "01.01.2016 10:13:30",
             "01.01.2016 04:30:08",
             "01.01.2016 08:49:40",
             "01.01.2016 07:44:45")

lapply(df,substring, first=1, last=16)
#$X.Start.Timestamp.
#[1] "Start.Timestamp"

#$X.01.01.2015.12.32.
#[1] "01.01.2015 12:32"

#$X.01.01.2015.08.22.
#[1] "01.01.2015 08:22"

#$X.01.01.2015.14.10.
#[1] "01.01.2015 14:10"

#$X.31.12.2014.21.43.
#[1] "31.12.2014 21:43"

...

Or any other of the apply functions, as I do not know how your whole data is set up.

Upvotes: 1

Related Questions