eabanoz
eabanoz

Reputation: 351

R taking list objects into table

I have a list that has many object inside it. I would like to create a table according attributes of this list.

head(casscade.list)
$`444424960908754944`
screen_name     tweet_id        tweet_created_at        retweet_screen_name
NerSeref        6.028628e+17    2015-05-25 11:44:24     Lasthowen
DURULMA_ZAMANI  6.028631e+17    2015-05-25 11:45:32     Lasthowen
ssari75         6.028647e+17    2015-05-25 11:52:10     Lasthowen   
saintserif2009  6.028672e+17    2015-05-25 12:01:48     Lasthowen
Hejinilim       6.028721e+17    2015-05-25 12:21:13     Lasthowen

$`407136916317171712`
screen_name     tweet_id        tweet_created_at        retweet_screen_name
isa_sakar       6.072663e+17    2015-06-06 15:22:18     cavurizmir
canfeda1923     6.072666e+17    2015-06-06 15:23:34     cavurizmir
Apolloniuss_58  6.072669e+17    2015-06-06 15:24:47     cavurizmir

I need to create a table that has to have these;

table
retweet_screen_name screen_name         length  life(seconds)
Lasthowen           Hejinilim           5       2209
cavurizmir          Apolloniuss_58      3       149 

I used this function and it solved half of the problem

get.summary <- function(i){
        curr.frame = cascade.list[[i]]
        return(c(unique(curr.frame$retweet_screen_name),curr.frame$screen_name[nrow(curr.frame)],
                 unique(curr.frame$retweet_created_at), curr.frame$tweet_created_at[nrow(curr.frame)], 
                 nrow(curr.frame)))
}    

and this code:

cdf=data.frame(t(sapply(1:length(cascade.list),get.summary)))

it creates a data frame with all variables in the same row.

 V1                                                                                     V2
c("EastanbulTimes", "onuryasercan", "2010-12-20 15:18:22", "2015-05-19 18:28:25", "1")  c("Lasthowen", "Apolloniuss_58", "2013-12-01 08:19:39", "2015-06-06 15:24:47", "3")

I need to fix data frame structure, it should have 6 columns and rows that is equal to list length. I also need to add to time variable.

Thanks for all advice in advance.

Upvotes: 1

Views: 1136

Answers (1)

Jaap
Jaap

Reputation: 83215

Because cascade.list is a list of dataframes with equal columns, you can bind them together into one dataset and then perform the aggregation you need. An implementation with data.table:

# make a list of the dataframes (see below for the used dataframes)
dflist <- list(df1,df2)
# bind the dataframes together into one datatable (which is an enhanced dataframe)
library(data.table)
DT <- rbindlist(dflist)

With the resulting datatable you can now perform the required summarisation as follows:

DT[, .(screen_name = screen_name[.N],
       length = .N,
       life_in_seconds = difftime(tweet_created_at[.N], tweet_created_at[1], units="secs")),
   by = .(retweet_screen_name)]

which results in:

   retweet_screen_name    screen_name length life_in_seconds
1:           Lasthowen      Hejinilim      5       2209 secs
2:          cavurizmir Apolloniuss_58      3        149 secs

Explanation:

  • .N is a special data.table operator which gives you the total number of rows in a group (or data.table when no grouping is used).
  • screen_name[.N] will give you the last screen_name because it is indexed with the total number of rows and thus gives you the last observation of each group. Likewise screen_name[1] would give you the first observation in each group.
  • difftime more or less speaks for itself. With units you can specify how the timedifference is expressed. See ?difftime for the possibilities.
  • With by = you can specify which columns should be used for determining the grouping of the data.

A similar operation can be done with dplyr:

library(dplyr)

newdf <- bind_rows(dflist)

newdf %>% group_by(retweet_screen_name) %>% 
  summarise(screen_name = last(screen_name),
            length = n(),
            life_in_seconds = difftime(last(tweet_created_at), first(tweet_created_at), units="secs"))

Used data:

df1 <- structure(list(screen_name = structure(c(3L, 1L, 5L, 4L, 2L), .Label = c("DURULMA_ZAMANI", "Hejinilim", "NerSeref", "saintserif2009", "ssari75"), class = "factor"), tweet_id = c(6.028628e+17, 6.028631e+17, 6.028647e+17, 6.028672e+17, 6.028721e+17), tweet_created_at = structure(c(1432547064, 1432547132, 1432547530, 1432548108, 1432549273), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Lasthowen", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(screen_name = structure(c(3L, 2L, 1L), .Label = c("Apolloniuss_58", "canfeda1923", "isa_sakar"), class = "factor"), tweet_id = c(6.072663e+17, 6.072666e+17, 6.072669e+17), tweet_created_at = structure(c(1433596938, 1433597014, 1433597087), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L), .Label = "cavurizmir", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 2

Related Questions