Reputation: 351
I have a list that has many object inside it. I would like to create a table according attributes of this list.
head(casscade.list)
$`444424960908754944`
screen_name tweet_id tweet_created_at retweet_screen_name
NerSeref 6.028628e+17 2015-05-25 11:44:24 Lasthowen
DURULMA_ZAMANI 6.028631e+17 2015-05-25 11:45:32 Lasthowen
ssari75 6.028647e+17 2015-05-25 11:52:10 Lasthowen
saintserif2009 6.028672e+17 2015-05-25 12:01:48 Lasthowen
Hejinilim 6.028721e+17 2015-05-25 12:21:13 Lasthowen
$`407136916317171712`
screen_name tweet_id tweet_created_at retweet_screen_name
isa_sakar 6.072663e+17 2015-06-06 15:22:18 cavurizmir
canfeda1923 6.072666e+17 2015-06-06 15:23:34 cavurizmir
Apolloniuss_58 6.072669e+17 2015-06-06 15:24:47 cavurizmir
I need to create a table that has to have these;
table
retweet_screen_name screen_name length life(seconds)
Lasthowen Hejinilim 5 2209
cavurizmir Apolloniuss_58 3 149
I used this function and it solved half of the problem
get.summary <- function(i){
curr.frame = cascade.list[[i]]
return(c(unique(curr.frame$retweet_screen_name),curr.frame$screen_name[nrow(curr.frame)],
unique(curr.frame$retweet_created_at), curr.frame$tweet_created_at[nrow(curr.frame)],
nrow(curr.frame)))
}
and this code:
cdf=data.frame(t(sapply(1:length(cascade.list),get.summary)))
it creates a data frame with all variables in the same row.
V1 V2
c("EastanbulTimes", "onuryasercan", "2010-12-20 15:18:22", "2015-05-19 18:28:25", "1") c("Lasthowen", "Apolloniuss_58", "2013-12-01 08:19:39", "2015-06-06 15:24:47", "3")
I need to fix data frame structure, it should have 6 columns and rows that is equal to list length. I also need to add to time variable.
Thanks for all advice in advance.
Upvotes: 1
Views: 1136
Reputation: 83215
Because cascade.list
is a list of dataframes with equal columns, you can bind them together into one dataset and then perform the aggregation you need. An implementation with data.table
:
# make a list of the dataframes (see below for the used dataframes)
dflist <- list(df1,df2)
# bind the dataframes together into one datatable (which is an enhanced dataframe)
library(data.table)
DT <- rbindlist(dflist)
With the resulting datatable you can now perform the required summarisation as follows:
DT[, .(screen_name = screen_name[.N],
length = .N,
life_in_seconds = difftime(tweet_created_at[.N], tweet_created_at[1], units="secs")),
by = .(retweet_screen_name)]
which results in:
retweet_screen_name screen_name length life_in_seconds
1: Lasthowen Hejinilim 5 2209 secs
2: cavurizmir Apolloniuss_58 3 149 secs
Explanation:
.N
is a special data.table
operator which gives you the total number of rows in a group (or data.table when no grouping is used).screen_name[.N]
will give you the last screen_name
because it is indexed with the total number of rows and thus gives you the last observation of each group. Likewise screen_name[1]
would give you the first observation in each group.difftime
more or less speaks for itself. With units
you can specify how the timedifference is expressed. See ?difftime
for the possibilities.by =
you can specify which columns should be used for determining the grouping of the data.A similar operation can be done with dplyr
:
library(dplyr)
newdf <- bind_rows(dflist)
newdf %>% group_by(retweet_screen_name) %>%
summarise(screen_name = last(screen_name),
length = n(),
life_in_seconds = difftime(last(tweet_created_at), first(tweet_created_at), units="secs"))
Used data:
df1 <- structure(list(screen_name = structure(c(3L, 1L, 5L, 4L, 2L), .Label = c("DURULMA_ZAMANI", "Hejinilim", "NerSeref", "saintserif2009", "ssari75"), class = "factor"), tweet_id = c(6.028628e+17, 6.028631e+17, 6.028647e+17, 6.028672e+17, 6.028721e+17), tweet_created_at = structure(c(1432547064, 1432547132, 1432547530, 1432548108, 1432549273), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Lasthowen", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(screen_name = structure(c(3L, 2L, 1L), .Label = c("Apolloniuss_58", "canfeda1923", "isa_sakar"), class = "factor"), tweet_id = c(6.072663e+17, 6.072666e+17, 6.072669e+17), tweet_created_at = structure(c(1433596938, 1433597014, 1433597087), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L), .Label = "cavurizmir", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 2