Reputation: 8247
I am doing web scraping of a website. When I fetch the data from a website every page has 10 observations. I am writing a function where you can specify no of pages to scrape and finally store it in a list and later convert it into dataframe.
library(jsonlite)
forum_data_fetch <- function(no_of_pages) {
pages <- seq(no_of_pages)
#print(pages)
forum_data <- list()
for(i in 1:length(pages)){
tmp <- fromJSON(paste("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call§ion=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=",i,sep=""))
forum_data[[i]] <- tmp
}
dat <- as.data.frame(forum_data)
dat <- dat[,c("msg_id","border_msg_count","user_id","border_level_text","follower_count", "topic", "tp_sector","tp_msg_count","heading", "flag", "price", "message")]
return(dat)
}
test <- forum_data_fetch(3)
Ideally, the above function returns 30 observations, but it returns only 10. I think I am doing something wrong while storing the list as a data.frame
Upvotes: 1
Views: 3036
Reputation: 7164
Instead of adding new rows to existing columns, as.data.frame(forum_data)
adds new columns (i.e. variables) with the same names.. Use do.call(rbind, forum_data)
instead:
dat1 <- as.data.frame(forum_data)
str(dat1)
# data.frame': 10 obs. of 219 variables:
# $ TOTAL_MSG_CNT : int 50000 NA NA NA NA NA NA NA NA NA
# $ msg_id : chr "47754017" "47754014" "47751119" "47746189" ...
# $ user_id : chr "rajeshatharv" "bullbuffet" "csr93" "sanjiv3312" ...
# ....
dat2 <- do.call(rbind, forum_data)
str(dat2)
# 'data.frame': 30 obs. of 73 variables:
# $ TOTAL_MSG_CNT : int 50000 NA NA NA NA NA NA NA NA NA ...
# $ msg_id : chr "47754017" "47754014" "47751119" "47746189" ...
# $ user_id : chr "rajeshatharv" "bullbuffet" "csr93" "sanjiv3312" ...
# ....
Then just select the columns you want to work with.
Upvotes: 1
Reputation: 2448
Here is how it works:
forum_data_fetch <- function(no_of_pages) {
require(data.table)
require(dplyr)
pages <- seq(no_of_pages)
forum_data <- list()
for(i in 1:length(pages)){
tmp <- fromJSON(paste("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call§ion=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=",i,sep=""))
forum_data[[i]] <- tmp
}
cat("the length of forum_data is", length(forum_data), "\n")
dat <- lapply(forum_data, as.data.frame) %>% rbindlist
dat <- dat[,c("msg_id","border_msg_count","user_id","border_level_text","follower_count", "topic", "tp_sector","tp_msg_count","heading", "flag", "price", "message")]
return(dat)
}
test <- forum_data_fetch(3)
dim(test)
The console output looks like
> test <- forum_data_fetch(3)
the length of forum_data is 3
> dim(test)
[1] 30 12
Upvotes: 1