AI52487963
AI52487963

Reputation: 179

Removing punctuation before tabulating data

I'm having issues with puling data from clipboard that happens to have lots of punctuation (quotes, commas, etc) in it. I'm attempting to pull in the entirety of Jane Austen's Pride and Prejudice as a plain text document via copying to clipboard into a variable in R for analysis.

If I do a

book <- read.table("clipboard", sep="\n")

I get an "EOF within quoted string" error. If I put the option to not have strings as factors:

book <- read.table("clipboard", sep="\n", stringsAsFactors=F)

I get the same error. This affects the table by putting multiple paragraphs together where quotations are present. If I open the book in a text editor and remove the double quotes and single quotes, then try either read.table option, the result is perfect.

Is there a way to remove punctuation prior to (or during?) the read.table phase? Would I dump the clipboard data into some kind of big vector then read.table off that vector?

Upvotes: 0

Views: 128

Answers (2)

Greg Snow
Greg Snow

Reputation: 49640

The read.table function is intended to read in data in a rectangular structure and put it into a data frame. I don't expect that the text of a book would fit that pattern in general. I would suggest reading the data with the scan or readLines function in place of read.table. Read the documentation for those functions on how to deal with quotes and separators.

If you still want to remove punctuation, then look at ?gsub, if you also want to convert all the characters to upper or lower case see ?chartr.

Upvotes: 0

bjoseph
bjoseph

Reputation: 2166

you need to disable quoting

this works for me

book <-read.table("http://www.gutenberg.org/cache/epub/1342/pg1342.txt",
sep="\n",quote="",stringsAsFactors=FALSE)

Upvotes: 1

Related Questions