monarque13
monarque13

Reputation: 578

Going from corpus to individual .txt files in R's tm

I have a .csv file with 6000 rows and 2 columns.I would like to write each row as a separate text file. Any ideas as to how this can be done in tm? I tried writeCorpus() but that function just spits out the 150 .txt files instead of 6000. Is this a memory issue or something I am doing wrong with the code?

 library(tm)
 revs<-read.csv("dprpfinals.csv",header=TRUE)
 corp<-Corpus(VectorSource(revs$Review))
 writeCorpus(corp,path=".",filenames=paste(seq_along(revs),".txt",sep=""))

Upvotes: 0

Views: 1844

Answers (2)

Ben
Ben

Reputation: 42303

No need to use tm for this, here's a reproducible example that makes a CSV file with 6000 rows and two columns, reads it in, and then turns it into 6000 txt files

First prepare some data for the example...

# from http://hipsum.co/?paras=4&type=hipster-centric
txt <- "Brunch single-origin coffee photo booth, meggings fixie stumptown pickled mumblecore slow-carb aesthetic ennui Odd Future blog plaid Bushwick. Seitan keffiyeh hashtag Portland, kitsch irony authentic vegan post-ironic. Actually pop-up flexitarian kale chips ethical authentic, stumptown meggings. Photo booth Helvetica farm-to-table Neutra. Selfies blog swag, lomo viral meh chillwave distillery deep v Truffaut. Squid Cosby sweater irony, art party mustache Vice Wes Anderson Bushwick McSweeney's locavore roof party paleo. 3 wolf moon salvia gentrify, taxidermy street art banh mi Portland deep v small batch Truffaut."

# get n random samples of this paragraph
n <- 6000
txt_split <- unlist(strsplit(txt, split = " "))
txts <- sapply(1:n, function(i) paste(sample(txt_split, 10, replace = TRUE), 
                                             collapse  = " "))

# make dataframe then CSV file, two cols, n rows.
my_csv <- data.frame( col_one = 1:n,
                      col_two = txts)
write.csv(my_csv, "my_csv.csv", row.names = FALSE, quote = TRUE)

Now we have a CSV file that might be similar to what you have, we can read it in:

# Read in the CSV file...

x <- read.csv("my_csv.csv", header = TRUE, stringsAsFactors = FALSE)

And now we can write each row of the CSV file to a separate text file (they will appear in your working directory):

# Write each row of the CSV to a txt file
sapply(1:nrow(x), function(i) write.table(paste(x[i,], collapse = " "), 
                                          paste0("my_txt_", i, ".txt"), 
                                          col.names = FALSE, row.names = FALSE))

If you really want to use tm, you were on the right track, this works fine for me:

# Read in the CSV file...
x <- read.csv("my_csv.csv", header = TRUE, stringsAsFactors = FALSE)
library(tm)
my_corpus <- Corpus(DataframeSource(x))
writeCorpus(my_corpus)

And closer to your example also works fine for me:

corp <- Corpus(VectorSource(x$col_one))
writeCorpus(corp)

If it's not working for you it might be something unusual about your CSV file, some weird characters and so on. Without more detail about your specific problem it's hard to say.

Upvotes: 0

user3969377
user3969377

Reputation:

Here is an example to split text into paragraphs, remove the empty lines, and write the lines to text files. Then you would need to process the text files.

txt="Argument split will be coerced to character, so you will see uses with split = NULL to mean split = character(0), including in the examples below.

Note that splitting into single characters can be done via split = character(0) ; the two are equivalent. The definition of 'character’ here depends on the locale: in a single-byte locale it is a byte, and in a multi-byte locale it is the unit represented by a ‘wide character’ (almost always a Unicode code point).

A missing value of split does not split the corresponding element(s) of x at all."

txt2<-data.frame(para = strsplit(txt, "\n")[[1]],stringsAsFactors=FALSE)
txt3<-txt2[txt2$para!="",]

npara = length(txt3)
for (ip in seq(1,npara)) {
  fname = paste("paragraph_",ip,".txt",sep="")
  fileConn<-file(fname)
  writeLines(txt3[ip], fileConn)
  close(fileConn)  
}

Upvotes: 0

Related Questions