Reputation: 109
In my recent research text mining.
This is my R code:
data <- list()
for( i in 0:8){
tmp <- paste('&page=', i, sep = '')
url <- paste('http://bbs.cyut.edu.tw/TopicClassList.aspx?ClassID=5', tmp, sep='')
html <- htmlParse(getURL(url))
url.list <- xpathSApply(html, "//table/tr[@style='height: 30px; font-size: small']/td/a[@href]", xmlAttrs)
url.list <- url.list[-2,]
data <- rbind(data, paste('http://bbs.cyut.edu.tw/', url.list, sep=''))
}
data <- unlist(data)
getwd()
setwd("C:/Users/user/Documents/doc4")
content_list <- list()
url_temp <- strsplit(data, '=')
id_list <- list()
for (i in 1:length(url_temp)){
id_list[[i]] <- url_temp[[i]][2]
}
getdoc <- function(line){
for (i in 1:length(id_list)) {
start <- regexpr('bbs', line)[1]
end <- regexpr(id_list[i], line)[1]
if(start != -1 & end != -1){
url <- substr(line, start, end+3)
html <- htmlParse(getURL(url), encoding='UTF-8')
doc <- xpathSApply(html, "//span", xmlValue)
name <- strsplit(url, '/')[[1]][3]
content_list[[i]] <- doc
write(doc, paste0(name, ".txt"))
}
}
}
sapply(data, getdoc)
The url_temp
has all URL.
I try to put in a URL in variable id_list
.
But content_list
doesn't exist all content. Where are errors?
How do I fix?
Upvotes: 1
Views: 37
Reputation: 109
I have resolved.
For everyone reference.
There are my code:
content_list <- list()
url_temp <- strsplit(data, '=')
id_list <- list()
for (i in 1:length(url_temp)){
id_list[[i]] <- url_temp[[i]][2]
}
getdoc <- function(line){
for (i in 1:length(id_list)) {
start <- regexpr('bbs', line)[1]
end <- regexpr(id_list[i], line)[1]
if(start != -1 & end != -1){
url <- substr(line, start, end+3)
html <- htmlParse(getURL(url), encoding='UTF-8')
doc <- xpathSApply(html, "//span", xmlValue)
name <- strsplit(url, '/')[[1]][3]
content_list[[i]] <- doc
lapply(content_list, write, "corpus.txt", append=TRUE, ncolumns=10000)
}
}
}
Upvotes: 1