Yves
Yves

Reputation: 556

Transformation of a vector to a data frame with two columns in R

I have the following dataset, it consists of ID's (integer, although all is stored as a factor) and units of time, which belong to the ID above (see an excerpt of data below, total data has 300'000+ values).

10000007
01:33:28
00:00:05
10000010
00:00:39
00:02:21
00:05:59
00:40:40
00:09:53
10000014
00:09:37
00:00:00
00:00:15
00:00:02
00:40:51
10000022
00:41:44
01:32:32
10000023
01:07:14
01:25:27
1000003
00:00:58
00:03:42
00:05:58
00:00:29
00:00:01
00:04:02
00:00:10
00:01:08
00:07:52
00:10:18
00:12:52
00:09:39
00:07:48
00:04:29

What I want is a data frame, which has in one column the ID and the other column the time. Something like this:

10000007  01:33:28
10000007  00:00:05
10000010  00:00:39
10000010  00:02:21
10000010  00:05:59
10000010  00:40:40
10000010  00:09:53
10000014  00:09:37
10000014  00:00:00
10000014  00:00:15
10000014  00:00:02
10000014  00:40:51
...

I struggle with the fact that the number of time stamps differs between the different ID's and the fact that I can't change the format to numeric efficiently, which might make the operation handier.

Eventually, the idea is to have the times summed up to have only one time per ID. Thank you very much in advance!

Upvotes: 1

Views: 75

Answers (3)

akrun
akrun

Reputation: 886938

In base R, we can use stack

i1 <- !grepl(":", df$V1)
out <- stack(setNames(split(df$V1[!i1], cumsum(i1)[!i1]), df$V1[i1]))[2:1]
head(out)
#    ind   values
#1 10000007 01:33:28
#2 10000007 00:00:05
#3 10000010 00:00:39
#4 10000010 00:02:21
#5 10000010 00:05:59
#6 10000010 00:40:40


dim(out)
#[1] 30  2

Or it can be done with data.frame call

data.frame(V1 = df$V1[i1][cumsum(i1)], V2 = df$V1)[!i1,]

Or with transform

transform(df, V2 = V1[i1][cumsum(i1)])[!i1,]

data

df <- structure(list(V1 = c("10000007", "01:33:28", "00:00:05", "10000010", 
"00:00:39", "00:02:21", "00:05:59", "00:40:40", "00:09:53", "10000014", 
"00:09:37", "00:00:00", "00:00:15", "00:00:02", "00:40:51", "10000022", 
"00:41:44", "01:32:32", "10000023", "01:07:14", "01:25:27", "1000003", 
"00:00:58", "00:03:42", "00:05:58", "00:00:29", "00:00:01", "00:04:02", 
"00:00:10", "00:01:08", "00:07:52", "00:10:18", "00:12:52", "00:09:39", 
"00:07:48", "00:04:29")), class = "data.frame", row.names = c(NA, 
-36L))

Upvotes: 1

jay.sf
jay.sf

Reputation: 72593

In base R split at those values where as.POSIXct yield NA, then rbind the result.

res <- do.call(rbind.data.frame, 
               by(x, cumsum(is.na(as.POSIXct(x, format="%T"))), function(x) 
                 cbind(as.character(x)[1], as.character(x)[-1])))
head(res)
#           V1       V2
# 1.1 10000007 01:33:28
# 1.2 10000007 00:00:05
# 2.1 10000010 00:00:39
# 2.2 10000010 00:02:21
# 2.3 10000010 00:05:59
# 2.4 10000010 00:40:40

Data:

x <- structure(c(31L, 30L, 4L, 32L, 8L, 11L, 16L, 24L, 21L, 33L, 19L, 
1L, 6L, 3L, 25L, 34L, 26L, 29L, 35L, 27L, 28L, 36L, 9L, 12L, 
15L, 7L, 2L, 13L, 5L, 10L, 18L, 22L, 23L, 20L, 17L, 14L), .Label = c("00:00:00", 
"00:00:01", "00:00:02", "00:00:05", "00:00:10", "00:00:15", "00:00:29", 
"00:00:39", "00:00:58", "00:01:08", "00:02:21", "00:03:42", "00:04:02", 
"00:04:29", "00:05:58", "00:05:59", "00:07:48", "00:07:52", "00:09:37", 
"00:09:39", "00:09:53", "00:10:18", "00:12:52", "00:40:40", "00:40:51", 
"00:41:44", "01:07:14", "01:25:27", "01:32:32", "01:33:28", "10000007", 
"10000010", "10000014", "10000022", "10000023", "1000003"), class = "factor")

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

One dplyr option could be:

df %>%
 group_by(grp = cumsum(grepl("^100", V1))) %>%
 mutate(V2 = first(V1)) %>%
 slice(-1) %>%
 ungroup() %>%
 select(-grp) 

   V1       V2      
   <chr>    <chr>   
 1 01:33:28 10000007
 2 00:00:05 10000007
 3 00:00:39 10000010
 4 00:02:21 10000010
 5 00:05:59 10000010
 6 00:40:40 10000010
 7 00:09:53 10000010
 8 00:09:37 10000014
 9 00:00:00 10000014
10 00:00:15 10000014
# … with 20 more rows

Upvotes: 1

Related Questions