appletree
appletree

Reputation: 353

R text mining - Combining paragraphs one after the other without sentences mixing up

beginner in R and text mining. Using the tm package currently.

I am trying to add the texts of two different documents in a corpora together. when I use a statement like

 c(corpus.doc[[1]],corpus.doc[[2]]) 

or the paste statement

  paste(corpus.doc[[1]],corpus.doc[[2]]) 

I get a result of texts combined for every line.

For example: if

> corpus.doc[[1]] 

He visits very often 
and 
sometimes more

> corpus.doc[[2]]) 

She also 
stays

What I get with these statements is something like

He visits very often She also
and stays
sometimes more

How can I prevent that and instead get

He visits very often
and 
sometimes more
She also 
stays

Or is there an easy way to combine documents in the R tm package? Thank you in advance!


Additional info


When I use
a <- c( corpus.doc[[1]], corpus.doc[[2]], recursive=TRUE)

I get that a becomes a corpus with two documents, so the texts of each of these documents are still not combined. I would like it that

a[[1]] 

gives me the combined text of corpus.doc[[1]] and corpus.doc[[2]].

str(corpus.doc)

Shows something like this

 List of 4270
 $ CREC-2011-01-05-pt1-PgE1-2.htm   :Classes     'PlainTextDocument',   'TextDocument', 
      'character'  atomic [1:74] html head titlecongression record volume  issue  
 head  ...
 .. ..- attr(*, "Author")= chr(0) 
 .. ..- attr(*, "DateTimeStamp")= POSIXlt[1:1], format: "2009-01-17 15:45:25"
 .. ..- attr(*, "Description")= chr(0) 
 . . ..- attr(, "Heading")= chr(0) .. ..- attr(, "ID")= chr "CREC-2011-01-05-pt1-PgE1- 2.htm"

And it keeps going on...

Upvotes: 5

Views: 3315

Answers (2)

Ben
Ben

Reputation: 42293

Further to my comment, how about if you combine your plain text documents in R before creating the corpus? For example, if 1.txt, 2.txt and 3.txt are plain text files, you can read them into R like so

a <- readLines(file("C:/Users/X/Desktop/1.txt"))
b <- readLines(file("C:/Users/X/Desktop/2.txt"))
c <- readLines(file("C:/Users/X/Desktop/3.txt"))

and then you could combine them, similar to your example above

abc <- c(a, b, c)

That will stack the documents up in order and preserve line-by-line format in a single data object, as you request. However, if you then make this into a corpus with

abc.corpus <- Corpus(VectorSource(abc)) # not what you want

then you'll get a corpus with as many documents as lines, which doesn't sound like what you want. Instead what you need to do is combine the text objects like this

abc.paste <- paste(a,b,c, collapse=' ') # this is what you want 

so that the resulting abc.paste object is a single line. Then when you make a corpus using

abc.corpus <- Corpus(VectorSource(abc.paste))

the result will be A corpus with 1 text document which you can then analyse with functions in the tm package.

It should be straightforward to extend this into a function to efficiently concatenate your 7000+ plain text documents and then make a corpus from the resulting data object. Does that get you any closer to what you want to do?

Upvotes: 1

IRTFM
IRTFM

Reputation: 263372

The help in pkg:tm says there is a c.Corpus function whose default setting for 'recursive' is FALSE but if set to TRUE may result in an "intelligent" merger. If you think copus.doc is a list of corpus-class objects, you might try:

c( corpus.doc[[1]], corpus.doc[[2]], recursive=TRUE)

... but it is not clear that you really do have "Corpus"-class objects.

str(corpus.doc)   # see above

So the first element in that very long list is not a Corpus-classed object, but rather a PlaintextDocument.

Upvotes: 2

Related Questions