Reputation: 335
This is the input file: http://www.yourfilelink.com/get.php?fid=841283 . I executed
options(stringsAsFactors=FALSE)
x=read.csv("test1.csv", header = FALSE, sep="'").
The result is this: http://www.yourfilelink.com/get.php?fid=841284
Instead of giving 135 rows, I am getting only 7 rows! Number of columns is correct, and is 13. x[6,10] has the content of the rows following it as well, just separated by \n in the string.
Please help me in this. I am stuck up in this problem! :/
Upvotes: 2
Views: 5586
Reputation: 10215
Check your text and think of what you expect from it when you were a computer. It starts without a delimiter ('), sees the first (') in press releases
, and starts to do stupid things after this. Don't count your first entries which are read, check the output first.
INSERT INTO message VALUES (52,'[email protected]','2000-01-21 04:51:00','<12435833.1075863606729.JavaMail.evans@thyme>','ENRON HOSTS
Upvotes: 1
Reputation: 263481
The described symptom of an extremely long item with multiple "\n"'s suggests you probably need to deal with unmatched quotes. If there is a quote mark in a name or address entry then the parser will wait for the next one before considering hte entry complete. Try"
x=read.csv("test1.csv", header = FALSE, sep="'", quote="")
That didn't actually work on the file I downloaded. (And do note that the sep argument will be ignored in read.csv
.) I needed to first use count.fields with that separator and then using read.table
with fill =TRUE
. The results were still a bit messed up with several columns being populated with commas but at least there is something to work with:
table( count.fields("~/Downloads/test1.txt", sep="'", quote=""))
10 13
5 130
x <- read.table("~/Downloads/test1.txt", header = FALSE, sep="'", quote="", stringsAsFactors=FALSE, skip=5)
#Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
# line 6 did not have 13 elements
x <- read.table("~/Downloads/test1.txt", header = FALSE, sep="'",
quote="", stringsAsFactors=FALSE, fill=TRUE)
str(x)
#########################################################
'data.frame': 135 obs. of 13 variables:
$ V1 : chr "INSERT INTO message VALUES (52," "INSERT INTO message VALUES (53," "INSERT INTO message VALUES (54," "INSERT INTO message VALUES (55," ...
$ V2 : chr "[email protected]" "[email protected]" "[email protected]" "[email protected]" ...
$ V3 : chr "," "," "," "," ...
$ V4 : chr "2000-01-21 04:51:00" "2000-01-24 01:37:00" "2000-01-24 02:06:00" "2000-02-02 10:21:00" ...
$ V5 : chr "," "," "," "," ...
$ V6 : chr "<12435833.1075863606729.JavaMail.evans@thyme>" "<29664079.1075863606676.JavaMail.evans@thyme>" "<15300605.1075863606629.JavaMail.evans@thyme>" "<10522232.1075863606538.JavaMail.evans@thyme>" ...
$ V7 : chr "," "," "," "," ...
$ V8 : chr "ENRON HOSTS ANNUAL ANALYST CONFERENCE PROVIDES BUSINESS OVERVIEW AND GOALS FOR 2000" "Over $50 -- You made it happen!" "Over $50 -- You made it happen!" "ROAD-SHOW.COM Q4i.COM CHOOSE ENRON TO DELIVER FINANCIAL WEB CONTENT" ...
$ V9 : chr "," "," "," "," ...
$ V10: chr "HOUSTON - Enron Corp. hosted its annual equity analyst conference today in==20Houston. Ken Lay, Enron chairman and chief execu"| __truncated__ "On Wall Street, people are talking about Enron. At Enron, we re talking=20about people...our people. You are the driving forc"| __truncated__ "On Wall Street, people are talking about Enron. At Enron, we re talking=20about people...our people. You are the driving forc"| __truncated__ "HOUSTON =01) Enron Broadband Services (EBS), a wholly owned subsidiary of E=nron=20Corp. and a leader in the delivery of high-b"| __truncated__ ...
$ V11: chr "" "," "," "," ...
$ V12: chr "" "Robert_Badeer_Aug2000Notes FoldersPress releases" "Robert_Badeer_Aug2000Notes FoldersPress releases" "Robert_Badeer_Aug2000Notes FoldersPress releases" ...
$ V13: chr "" ");" ");" ");" ...
I got better results with a comma as separator and just single quote rather than the default single- or double-quote that the read.*
-functions use:
x2 <- read.table("~/Downloads/test1.txt", header = FALSE, sep=",",
quote="'", stringsAsFactors=FALSE, fill=TRUE)
str(x2)
Upvotes: 5