Daniel Vargas
Daniel Vargas

Reputation: 1040

Splitting tokenize a corpus with R and Quanteda

I am working on a project for NLP. I need to take some blogs, news and tweets (you have probably heard of this capstone already) in .txt files and create n-grams frequencies.

I did experiments on the steps to take the txt files to a frequencies data frame for analysis:

Read > Conver to corpus > Clean corpus > Tokenize > Convert to dfm > Convert to df

The bottle necks in the process were the tokenize and convert to dfm steps (over 5x more time).

I had two choices:

1. Split the cleaned corpus to tokenize by piece
2. Split-read the .txt files from the beginning

No. 1 seemed the best, but so far I have not found a function or package that can do this in a way I want. So I will write a long code to split-read from the beginning in 20 chunks (due to my computing constraints).

Is there a way I can split a corpus ("corpus" "list") created with the quanteda package in chunks (defined lines by me) so I can tokenize and turn to dfm in a "streaming" kinda way?

Upvotes: 1

Views: 1396

Answers (2)

Len Greski
Len Greski

Reputation: 10845

Since this question hasn't been directly answered, I am reposting related content from the article I wrote in 2016 as a Community Mentor for the JHU Capstone, Capstone n-grams: how much processing power is required?

Overview

Students in the Johns Hopkins University Data Science Specialization Capstone course typically struggle with the course project because of the amount of memory consumed by the objects needed to analyze text. The question asks about the best approach for processing the 4+ million documents that are in the raw data files. The short answer to this question is that it depends on the amount of RAM on one's machine. Since R objects must reside in RAM, one must understand the amount of RAM consumed by the objects being processed.

A machine with 16Gb of RAM is required to process all of the data from the three files without processing it in smaller chunks or processing a random sample of data. My testing indicates that the working memory needed to process the files is approximately 1.5 - 3 times the size of the object output by the quanteda::tokens_ngrams() function from quanteda version 0.99.22, and therefore a 1 Gb tokenized corpus and consumes 9 Gb of RAM to generate a 4 Gb n-gram object. Note that quanteda automatically uses multiple threads if your computer has multiple cores / threads.

To help reduce the guesswork in the memory utilization, here is a summary of the amount of RAM consumed by objects required to analyze the files for the Swift Key sponsored capstone: predicting text.

Raw data

There are three raw data files used in the Capstone project. Once loaded into memory using a text processing function such as readLines() or readr::read_lines(), the resulting object sizes are as follows.

  1. en_US.blogs.txt: 249 Mb
  2. en_US.news.txt: 250 Mb
  3. en_US.twitter.txt: 301 Mb

These files must be joined into a single object and converted to a corpus. Together they consume about 800 Mb of RAM.

When converted to a corpus with quanteda::corpus() the resulting file size is 1.1 Gb in size.

N-gram object sizes

To maximize the amount of RAM available for n-gram processing, once the corpus is generated, one must remove all objects from memory other than the tokenized corpus used as input to tokens_ngrams(). The object sizes for various n-grams is as follows.

  1. 2-grams: 6.3 Gb
  2. 3-grams: 6.5 Gb
  3. 4-grams: 6.5 Gb
  4. 5-grams: 6.3 Gb
  5. 6-grams: 6.1 Gb

Working with less memory

I was able to process a 25% sample of the capstone data on a MacBook Pro with 8 Gb of RAM, and a 5% sample on an HP Chromebook running Ubuntu Linux with 4 Gb of RAM.

Object sizes: 25% sample

  1. 2-grams: 2.0 Gb
  2. 3-grams: 2.9 Gb
  3. 4-grams: 3.6 Gb
  4. 5-grams: 3.9 Gb
  5. 6-grams: 4.0 Gb

Object sizes: 5% sample

  1. 2-grams: 492 Mb
  2. 3-grams: 649 Mb
  3. 4-grams: 740 Mb
  4. 5-grams: 747 Gb
  5. 6-grams: 733 Gb

Processing the data in smaller groups

Adding to Ken Benoit's comment to the original question, one can assign a numeric group (e.g. repeating IDs of 1 - 20 to split to 20 groups) and then use the corpus_segment() function to segment the corpus on group ID. However, this approach results in a corpus that is tagged, not physically split. A general process to generate all of the required n-grams is represented in the following pseudocode.

 split the raw data into a list of <n> groups for processing
 create a list of corpuses 
 for each corpus
     for each size n-gram 
           1. generate n-grams
           2. write to file
           3. rm() n-gram object

Code to split the corpus into a list and process one set of n-grams looks like this, once the data has been downloaded and extracted from the swiftkey.zip file.

library(readr)
library(data.table)
blogFile <- "./capstone/data/en_us.blogs.txt"
newsFile <- "./capstone/data/en_us.news.txt"
twitterFile <- "./capstone/data/en_us.twitter.txt"
blogData <- read_lines(blogFile)
newsData <- read_lines(newsFile)
twitterData <- read_lines(twitterFile) 
allData <- c(blogData,newsData,twitterData) # about 800mb file
rm(blogData,newsData,twitterData,blogFile,newsFile,twitterFile)

# create 20 groups, and use to split original data into a list
groupId <- paste0("GROUP",sprintf("%02.0f",(1:length(allData) %% 20)+1))
split_data <- split(allData,groupId)
library(quanteda)
theTexts <- lapply(split_data,corpus)
system.time(ngram2 <- lapply(theTexts,function(x) tokens_ngrams(tokens(x),n=2))) 
head(ngram2[[1]]) 

Note that on a MacBook Pro with an i7 Intel processor it takes about 10 minutes to generate the n-ngrams, and the resulting output, ngram2 is a list of 20 sets of 2-grams.

...and the output for the first three texts in the first group is:

> head(ngram2[[1]])
Tokens consisting of 6 documents.
text1 :
[1] "Fallout_by"    "by_Ellen"      "Ellen_Hopkins" "Hopkins_("     "(_p"          
[6] "p_."           "._1-140"       "1-140_)"      

text2 :
 [1] "Ed_Switenky"         "Switenky_,"          ",_manager"          
 [4] "manager_of"          "of_traffic"          "traffic_engineering"
 [7] "engineering_and"     "and_operations"      "operations_,"       
[10] ",_couldn't"          "couldn't_comment"    "comment_on"         
[ ... and 21 more ]

text3 :
[1] "the_autumn"       "autumn_rains"     "rains_in"         "in_righteousness"
[5] "righteousness_." 

Additional code to write the files to disk to conserve memory, as well as code to clean the data before processing and consolidate the n-grams into frequency tables, is left as work for the student.

Upvotes: 1

Phi
Phi

Reputation: 414

I think the package you will find most useful currently is the tm package. It is a pretty complex but thorough package even though its still in an experimental state at version .7.1. Without more detail I can't give you more exact usage info because it all depends on your sources, how you want to process the corpus and other factors. The gist of what you'll need to do is first create a reader object dependant on your source material. It can handle web input, plain texts, pdf and others. Then you can use one of the Corpus creation functions depending on whether you want to keep the whole thing in memory etc. You can then use the various 'tidying' functions to operate on the entire corpus as though each document were an element in a vector. You can do the same with tokenizing. With a few more specifics we can give you more specific answers.

Upvotes: 0

Related Questions