Zaynaib Giwa
Zaynaib Giwa

Reputation: 5856

How to read a text file in R as one line

I am trying to process a text file. Overall I have a Corpus that I would like to analyze. In order to use the tm package (a text mining package in R) to create a Corpus object I need to make this paragraph to become one gigantic vector in order to be read properly.

I have a paragraph

          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

I've used both the scan and readLine methods and it processes the text like this:

[28] " commercial exploitation over the past two hundred years drove "
[29] " the great mysticete whales to near extinction variation in "
[30] " the sizes of populations prior to exploitation minimal "

Is there a way to get rid of the line breaks? Or to read the text file as one gigantic vector?

All of the solution posted have been great so far thank you.

Upvotes: 4

Views: 6895

Answers (3)

Rich Scriven
Rich Scriven

Reputation: 99341

If there is too much processing to be done on the file, it may take a long time to read. You may consider reading it in unchanged and then make the changes. The stringi package has a function for this particular operation. And the authors write in C so their functions are nice and fast

So assuming you've read in the file, and named it txt,

library(stringi)
stri_flatten(txt)
# [1] "          Commercial exploitation over the past two hundred years drove                  \n          the great Mysticete whales to near extinction.  Variation in                   \n          the sizes of populations prior to exploitation, minimal                        \n          population size during exploitation and current population                     \n          sizes permit analyses of the effects of differing levels of                    \n          exploitation on species with different biogeographical                         \n          distributions and life-history characteristics."

And the string is still in the same format, only flattened. To check that we can look at cat

cat(stri_flatten(txt))
          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

Upvotes: 4

Jim
Jim

Reputation: 4767

This will read the entire file into a length one character vector.

x <- readChar(file, file.info(file)$size)

Upvotes: 6

renato vitolo
renato vitolo

Reputation: 1754

I had the same problem a while ago and found a workaround: to read the individual lines and then paste them together, removing the "\n" newlines:

filename <- "tmp.txt"
paste0(readLines(filename),collapse=" ")

If you need the newlines, then you can read the file as a character string

readChar(filename,1e5)

specifying a sufficiently large number of characters (100000 in this case).

Upvotes: 3

Related Questions