Reputation: 353
beginner in R and text mining. Using the tm package currently.
I am trying to add the texts of two different documents in a corpora together. when I use a statement like
c(corpus.doc[[1]],corpus.doc[[2]])
or the paste statement
paste(corpus.doc[[1]],corpus.doc[[2]])
I get a result of texts combined for every line.
For example: if
> corpus.doc[[1]]
He visits very often
and
sometimes more
> corpus.doc[[2]])
She also
stays
What I get with these statements is something like
He visits very often She also
and stays
sometimes more
How can I prevent that and instead get
He visits very often
and
sometimes more
She also
stays
Or is there an easy way to combine documents in the R tm package? Thank you in advance!
Additional info
When I use
a <- c( corpus.doc[[1]], corpus.doc[[2]], recursive=TRUE)
I get that a becomes a corpus with two documents, so the texts of each of these documents are still not combined. I would like it that
a[[1]]
gives me the combined text of corpus.doc[[1]] and corpus.doc[[2]].
str(corpus.doc)
Shows something like this
List of 4270
$ CREC-2011-01-05-pt1-PgE1-2.htm :Classes 'PlainTextDocument', 'TextDocument',
'character' atomic [1:74] html head titlecongression record volume issue
head ...
.. ..- attr(*, "Author")= chr(0)
.. ..- attr(*, "DateTimeStamp")= POSIXlt[1:1], format: "2009-01-17 15:45:25"
.. ..- attr(*, "Description")= chr(0)
. . ..- attr(, "Heading")= chr(0) .. ..- attr(, "ID")= chr "CREC-2011-01-05-pt1-PgE1- 2.htm"
And it keeps going on...
Upvotes: 5
Views: 3315
Reputation: 42293
Further to my comment, how about if you combine your plain text documents in R
before creating the corpus? For example, if 1.txt
, 2.txt
and 3.txt
are plain text files, you can read them into R
like so
a <- readLines(file("C:/Users/X/Desktop/1.txt"))
b <- readLines(file("C:/Users/X/Desktop/2.txt"))
c <- readLines(file("C:/Users/X/Desktop/3.txt"))
and then you could combine them, similar to your example above
abc <- c(a, b, c)
That will stack the documents up in order and preserve line-by-line format in a single data object, as you request. However, if you then make this into a corpus with
abc.corpus <- Corpus(VectorSource(abc)) # not what you want
then you'll get a corpus with as many documents as lines, which doesn't sound like what you want. Instead what you need to do is combine the text objects like this
abc.paste <- paste(a,b,c, collapse=' ') # this is what you want
so that the resulting abc.paste
object is a single line. Then when you make a corpus using
abc.corpus <- Corpus(VectorSource(abc.paste))
the result will be A corpus with 1 text document
which you can then analyse with functions in the tm
package.
It should be straightforward to extend this into a function to efficiently concatenate your 7000+ plain text documents and then make a corpus from the resulting data object. Does that get you any closer to what you want to do?
Upvotes: 1
Reputation: 263372
The help in pkg:tm says there is a c.Corpus function whose default setting for 'recursive' is FALSE but if set to TRUE may result in an "intelligent" merger. If you think copus.doc is a list of corpus-class objects, you might try:
c( corpus.doc[[1]], corpus.doc[[2]], recursive=TRUE)
... but it is not clear that you really do have "Corpus"-class objects.
str(corpus.doc) # see above
So the first element in that very long list is not a Corpus-classed object, but rather a PlaintextDocument.
Upvotes: 2