Reputation: 4229
I have a comma separated dataset of around 10,000 rows. When doing read.csv, R created a dataframe rows lesser than the original file. It excluded/rejected 200 rows. When I open the csv file in Excel, the file looks okay. The file is well formatted for line delimiters and also field delimiters (as per parsing done by Excel).
I have identified the row numbers in my file which are getting rejected but I can't identify the cause by glancing over them.
Is there any way to look at logs or something which includes reason why R rejected these records?
Upvotes: 7
Views: 29004
Reputation: 71
I had same issue where difference between number of rows present in csv file and number of rows read by read.csv()
command was significant. I used fread()
command from data.table
package in place of read.csv and it solved the problem.
Upvotes: 7
Reputation: 8105
The OP indicates that the problem is caused by quotes in the CSV-file.
When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote=""
option in read.csv
. This disables quotes.
data <- read.csv(filename, quote="")
Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.
lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))
A slightly more safe solution, which will only delete quotes when not just before or after a comma:
lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))
Upvotes: 19
Reputation: 4921
In your last question you want to remove double quotes (that is "") before reading the csv file in R. This probably is best done as a file preprocessing step using a one line Shell scripting "sed" comment (treated in the Unix & Linux forum).
sed -i 's/""/"/g' test.csv
Upvotes: 0
Reputation: 4229
The records rejected was due to presence of double quotes in the csv file. I removed the double quotes on notepad++ before reading the file in R. If you can suggest a better way to remove the double quotes in R (before reading the file), please leave a comment below.
Pointed out by Jan van der Laan. He deserves the credit.
Upvotes: 1