Remi Castonguay
Remi Castonguay

Reputation: 69

chunking txt files in R

all,

I'm working form Matthew Jockers's code in his "Text Analysis with R for Students of Literature" book.

In it he provides code to pull all <p> tags from XML documents, chop that content in 1000 words chunks and apply a bunch data massaging tricks. Once that's done, he inserts that chunking function in a loop that produces a data matrix that is ready to be used in mallet. Please see the code below.

My question is, how do I do the same thing with .txt files? Obviously, text files do not have attributes like <p> to work from. I'm not an experienced programmer so go easy on me please!!!


chunk.size <- 1000 #number of words per chunk
makeFlexTextChunks <- function(doc.object, chunk.size=1000,  percentage=TRUE){

paras <- getNodeSet(doc.object,
                  "/d:TEI/d:text/d:body//d:p",
                  c(d = "http://www.tei-c.org/ns/1.0"))
words <- paste(sapply(paras,xmlValue), collapse=" ")
words.lower <- tolower(words)
words.lower <- gsub("[^[:alnum:][:space:]']", " ", words.lower)
words.l <- strsplit(words.lower, "\\s+")
word.v <- unlist(words.l)
x <- seq_along(word.v)
if(percentage){
max.length <- length(word.v)/chunk.size
chunks.l <- split(word.v, ceiling(x/max.length))
  } else {
chunks.l <- split(word.v, ceiling(x/chunk.size))
#deal with small chunks at the end
if(length(chunks.l[[length(chunks.l)]]) <=
   length(chunks.l[[length(chunks.l)]])/2){
  chunks.l[[length(chunks.l)-1]] <-
    c(chunks.l[[length(chunks.l)-1]],
      chunks.l[[length(chunks.l)]])
  chunks.l[[length(chunks.l)]] <- NULL
}
}
chunks.l <- lapply(chunks.l, paste, collapse=" ")
chunks.df <- do.call(rbind, chunks.l)
return(chunks.df)
}


topic.m <- NULL
for(i in 1:length(files.v)){
doc.object <- xmlTreeParse(file.path(input.dir, files.v[i]),
                         useInternalNodes=TRUE)
chunk.m <- makeFlexTextChunks(doc.object, chunk.size,
                            percentage=FALSE)
textname <- gsub("\\..*","", files.v[i])
segments.m <- cbind(paste(textname,
                        segment=1:nrow(chunk.m), sep="_"), chunk.m)
topic.m <- rbind(topic.m, segments.m)
}

Upvotes: 0

Views: 409

Answers (2)

Remi Castonguay
Remi Castonguay

Reputation: 69

Thank you everybody for your help. I think I found my answer after much trial and error! The key was to pull the txt files with scan(paste(input.dir, files.v[i], sep="/") in the loop rather than the function. Please see my code here:

input.dir <- "data/plainText"
files.v <- dir(input.dir, ".*txt")
chunk.size <- 100 #number of words per chunk
makeFlexTextChunks <- function(doc.object, chunk.size=100,  percentage=TRUE){
words.lower <- tolower(paste(doc.object, collapse=" "))
words.lower <- gsub("[^[:alnum:][:space:]']", " ", words.lower)
words.l <- strsplit(words.lower, "\\s+")
word.v <- unlist(words.l)
x <- seq_along(word.v)

if(percentage){
max.length <- length(word.v)/chunk.size
chunks.l <- split(word.v, ceiling(x/max.length))
}  
else {
chunks.l <- split(word.v, ceiling(x/chunk.size))
#deal with small chunks at the end
if(length(chunks.l[[length(chunks.l)]]) <=
   length(chunks.l[[length(chunks.l)]])/2){
  chunks.l[[length(chunks.l)-1]] <-
    c(chunks.l[[length(chunks.l)-1]],
      chunks.l[[length(chunks.l)]])
  chunks.l[[length(chunks.l)]] <- NULL
 }
}
chunks.l <- lapply(chunks.l, paste, collapse=" ")
chunks.df <- do.call(rbind, chunks.l)
return(chunks.df)
}

topic.m <- NULL
for(i in 1:length(files.v)){
doc.object <- scan(paste(input.dir, files.v[i], sep="/"), what="character", sep="\n")
chunk.m <- makeFlexTextChunks(doc.object, chunk.size, percentage=FALSE)
textname <- gsub("\\..*","", files.v[i])
segments.m <- cbind(paste(textname, segment=1:nrow(chunk.m), sep="_"), chunk.m)
topic.m <- rbind(topic.m, segments.m)
}

Upvotes: 1

ekstroem
ekstroem

Reputation: 6191

Maybe this can point you in the right direction. The following code reads in a txt file a splits the words up into elements of a vector.

library(readr)
library(stringr)

url <- "http://www.gutenberg.org/files/98/98-0.txt"
mystring <- read_file(url)
res <- str_split(mystring, "\\s+")

Then you can split it into chunks of 1000 words and do your magic?

Upvotes: 0

Related Questions