Reputation: 5856
I am trying to process a text file. Overall I have a Corpus that I would like to analyze. In order to use the tm package (a text mining package in R) to create a Corpus object I need to make this paragraph to become one gigantic vector in order to be read properly.
I have a paragraph
Commercial exploitation over the past two hundred years drove
the great Mysticete whales to near extinction. Variation in
the sizes of populations prior to exploitation, minimal
population size during exploitation and current population
sizes permit analyses of the effects of differing levels of
exploitation on species with different biogeographical
distributions and life-history characteristics.
I've used both the scan and readLine methods and it processes the text like this:
[28] " commercial exploitation over the past two hundred years drove "
[29] " the great mysticete whales to near extinction variation in "
[30] " the sizes of populations prior to exploitation minimal "
Is there a way to get rid of the line breaks? Or to read the text file as one gigantic vector?
All of the solution posted have been great so far thank you.
Upvotes: 4
Views: 6895
Reputation: 99341
If there is too much processing to be done on the file, it may take a long time to read. You may consider reading it in unchanged and then make the changes. The stringi
package has a function for this particular operation. And the authors write in C so their functions are nice and fast
So assuming you've read in the file, and named it txt
,
library(stringi)
stri_flatten(txt)
# [1] " Commercial exploitation over the past two hundred years drove \n the great Mysticete whales to near extinction. Variation in \n the sizes of populations prior to exploitation, minimal \n population size during exploitation and current population \n sizes permit analyses of the effects of differing levels of \n exploitation on species with different biogeographical \n distributions and life-history characteristics."
And the string is still in the same format, only flattened. To check that we can look at cat
cat(stri_flatten(txt))
Commercial exploitation over the past two hundred years drove
the great Mysticete whales to near extinction. Variation in
the sizes of populations prior to exploitation, minimal
population size during exploitation and current population
sizes permit analyses of the effects of differing levels of
exploitation on species with different biogeographical
distributions and life-history characteristics.
Upvotes: 4
Reputation: 4767
This will read the entire file into a length one character vector.
x <- readChar(file, file.info(file)$size)
Upvotes: 6
Reputation: 1754
I had the same problem a while ago and found a workaround: to read the individual lines and then paste them together, removing the "\n" newlines:
filename <- "tmp.txt"
paste0(readLines(filename),collapse=" ")
If you need the newlines, then you can read the file as a character string
readChar(filename,1e5)
specifying a sufficiently large number of characters (100000 in this case).
Upvotes: 3