R taking list objects into table

Question

I have a list that has many object inside it. I would like to create a table according attributes of this list.

head(casscade.list)
$`444424960908754944`
screen_name     tweet_id        tweet_created_at        retweet_screen_name
NerSeref        6.028628e+17    2015-05-25 11:44:24     Lasthowen
DURULMA_ZAMANI  6.028631e+17    2015-05-25 11:45:32     Lasthowen
ssari75         6.028647e+17    2015-05-25 11:52:10     Lasthowen   
saintserif2009  6.028672e+17    2015-05-25 12:01:48     Lasthowen
Hejinilim       6.028721e+17    2015-05-25 12:21:13     Lasthowen

$`407136916317171712`
screen_name     tweet_id        tweet_created_at        retweet_screen_name
isa_sakar       6.072663e+17    2015-06-06 15:22:18     cavurizmir
canfeda1923     6.072666e+17    2015-06-06 15:23:34     cavurizmir
Apolloniuss_58  6.072669e+17    2015-06-06 15:24:47     cavurizmir

I need to create a table that has to have these;

table
retweet_screen_name screen_name         length  life(seconds)
Lasthowen           Hejinilim           5       2209
cavurizmir          Apolloniuss_58      3       149

the first row will be the name in the retweet screen name (because of that it is duplicated one of them will be enough),
the second row will be the last screen_name of list object
the third row will be the length of the list object
the fourth row will be the time difference between the first tweet_created_at and the last of the list object

I used this function and it solved half of the problem

get.summary <- function(i){
        curr.frame = cascade.list[[i]]
        return(c(unique(curr.frame$retweet_screen_name),curr.frame$screen_name[nrow(curr.frame)],
                 unique(curr.frame$retweet_created_at), curr.frame$tweet_created_at[nrow(curr.frame)], 
                 nrow(curr.frame)))
}

and this code:

cdf=data.frame(t(sapply(1:length(cascade.list),get.summary)))

it creates a data frame with all variables in the same row.

 V1                                                                                     V2
c("EastanbulTimes", "onuryasercan", "2010-12-20 15:18:22", "2015-05-19 18:28:25", "1")  c("Lasthowen", "Apolloniuss_58", "2013-12-01 08:19:39", "2015-06-06 15:24:47", "3")

I need to fix data frame structure, it should have 6 columns and rows that is equal to list length. I also need to add to time variable.

Thanks for all advice in advance.

Jaap · Accepted Answer

Because cascade.list is a list of dataframes with equal columns, you can bind them together into one dataset and then perform the aggregation you need. An implementation with data.table:

# make a list of the dataframes (see below for the used dataframes)
dflist <- list(df1,df2)
# bind the dataframes together into one datatable (which is an enhanced dataframe)
library(data.table)
DT <- rbindlist(dflist)

With the resulting datatable you can now perform the required summarisation as follows:

DT[, .(screen_name = screen_name[.N],
       length = .N,
       life_in_seconds = difftime(tweet_created_at[.N], tweet_created_at[1], units="secs")),
   by = .(retweet_screen_name)]

which results in:

   retweet_screen_name    screen_name length life_in_seconds
1:           Lasthowen      Hejinilim      5       2209 secs
2:          cavurizmir Apolloniuss_58      3        149 secs

Explanation:

.N is a special data.table operator which gives you the total number of rows in a group (or data.table when no grouping is used).
screen_name[.N] will give you the last screen_name because it is indexed with the total number of rows and thus gives you the last observation of each group. Likewise screen_name[1] would give you the first observation in each group.
difftime more or less speaks for itself. With units you can specify how the timedifference is expressed. See ?difftime for the possibilities.
With by = you can specify which columns should be used for determining the grouping of the data.

A similar operation can be done with dplyr:

library(dplyr)

newdf <- bind_rows(dflist)

newdf %>% group_by(retweet_screen_name) %>% 
  summarise(screen_name = last(screen_name),
            length = n(),
            life_in_seconds = difftime(last(tweet_created_at), first(tweet_created_at), units="secs"))

Used data:

df1 <- structure(list(screen_name = structure(c(3L, 1L, 5L, 4L, 2L), .Label = c("DURULMA_ZAMANI", "Hejinilim", "NerSeref", "saintserif2009", "ssari75"), class = "factor"), tweet_id = c(6.028628e+17, 6.028631e+17, 6.028647e+17, 6.028672e+17, 6.028721e+17), tweet_created_at = structure(c(1432547064, 1432547132, 1432547530, 1432548108, 1432549273), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Lasthowen", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(screen_name = structure(c(3L, 2L, 1L), .Label = c("Apolloniuss_58", "canfeda1923", "isa_sakar"), class = "factor"), tweet_id = c(6.072663e+17, 6.072666e+17, 6.072669e+17), tweet_created_at = structure(c(1433596938, 1433597014, 1433597087), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L), .Label = "cavurizmir", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -3L))

R taking list objects into table

Answers (1)

Related Questions