read.csv vs read.table - difficulty in comparing results

Question

I have a tab separated data with a column containing addresses including commas in the addresses.

I am using read.table to import a data into R, however my colleague used read.csv with sep=" " to do the same and we both end up with different number of rows in the imported data frame.

Also, when I import the data in Excel, I get the same number of records as read.csv with sep=" ".

What is the most concrete way i can verify which import and number of records is the correct one?

Please let me know what details I can add here to help answer the question.

rbatt · Accepted Answer

Read the help files for the two functions via ?read.table (that'll show both). You'll see that read.csv is just read.table with some of the arguments set to different defaults.

One of those arguments is header. In read.table with sep=" ", try also using header=TRUE.

If that doesn't work, do the following: read.table('file.txt', header=TRUE, sep=" ", quote=""", dec=".", fill=TRUE, comment.char="". That code should give the exact same result as read.csv, because I just set all the arguments to those used by read.csv. You can then begin by changing some of those arguments back to the read.table default (by not specifying them) to figure out which argument is causing the difference between read.csv and read.table for your data.frame (remember, more than one argument could be causing the difference). I can easily see ways that the header, sep, quote, comment.char, and fill arguments could affect the number of rows in the output. I can't imagine how dec would have this effect, but I wouldn't be surprised if it matters.

As a rule, I tend to expect that different input = different output, and when different input = same output, I consider that to be exceptional. The functions you're using are similar, but they're differences are different ways of interpreting the text file, so I would expect them to yield different results. Which is "right" is not a matter of one of the functions preforming correctly and the other incorrectly, it's a matter of the user understanding what they are doing in relation to the input.

read.csv vs read.table - difficulty in comparing results

Answers (1)

Related Questions