Reputation: 869
I am trying to generate some frequencies and a single corpus for a NLP project and running into an issue with the tm package. My Sample data came from a blog feed from the following link:
# specify the source and destination of the download
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# load the libraries
library(tm)
library(RWeka)
library(dplyr)
library(magrittr)
# load the sample data
load("sample_data.RData")
# ngram tokaniser
n <- 2L
bigram_token <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
n <- 3L
trigram_token <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
# check length function
length_is <- function(n) function(x) length(x)==n
# contruct single corpus from sample data
vc_blogs <-
sample_blogs %>%
data.frame() %>%
DataframeSource() %>%
VCorpus %>%
tm_map( stripWhitespace )
Getting the following Error:
Error in DataframeSource(.) :
all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
Is there a fix or a work-around to process the piece of code successfully?
Upvotes: 1
Views: 46
Reputation: 887571
According to ?DataframeSource
A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a UTF-8 encoded string representing the document's content. Optional additional columns are used as document level metadata.
In the OP's example, there is only a single column and it is also not named accordingly
Upvotes: 1