hbabbar
hbabbar

Reputation: 967

Output for read.csv()

I've been trying to load a csv into R for some processing but I'm facing a strange issue while trying to read the data itself.

The csv doesnt have any headers and i'm using the following simple code to read the data

newClick <- read.csv("test.csv", header = F)

And following is the sample dataset :

10000011791441224671,V_Display,exit
10000011951441812316,V_Display,exit
10000013211441319797,V_Display,exit
1000001331441725509,V_Display,exit
10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit
10000014031441295393,V_Display,exit

The output for this data being the expected data frame of 6 obs. of 18 variables.

Here is the tricky part however. If I add another row in the dataset like

10000011791441224671,V_Display,exit
10000011951441812316,V_Display,exit
1000000191441228436,V_Display,exit
10000013211441319797,V_Display,exit
1000001331441725509,V_Display,exit
10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit
10000014031441295393,V_Display,exit

The output for this is a strange 12 obs of 3 variables. On close analysis I realised that the entire second last row got divided into 6 rows with three columns each which is weird.

Any thoughts on this?

Upvotes: 5

Views: 1004

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99351

As mentioned in the comments, this occurs because the number of columns is determined by the first five lines of input. If you're in a jam, here's a possible workaround that I have tested and seems to run well. The secret is to enter a vector for col.names that is the length of the number of columns in the data. We can get the number of columns by using count.fields(). Insert your file name for file.

## get the number of columns
ncols <- max(count.fields(file, sep = ","))
## read the data with all columns as character
df <- read.csv(file, header = FALSE, col.names = paste0("V", seq_len(ncols)))

Here is the tested code with your data:

txt <- "10000011791441224671,V_Display,exit\n10000011951441812316,V_Display,exit\n1000000191441228436,V_Display,exit\n10000013211441319797,V_Display,exit\n1000001331441725509,V_Display,exit\n10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit\n10000014031441295393,V_Display,exit"

ncols <- max(count.fields(textConnection(txt), sep = ","))
df <- read.csv(text = txt, header = FALSE, col.names = paste0("V", seq_len(ncols)))
dim(df)
# [1]  7 18

Upvotes: 3

flaco777
flaco777

Reputation: 39

Per the r Documentation,

"The number of data columns is determined by looking at the first five lines of >input (or the whole input if it has less than five lines), or from the length of >col.names if it is specified and is longer. This could conceivably be wrong if >fill or blank.lines.skip are true, so specify col.names if necessary"

Because the first 5 rows contain the wider observation in the first example, and not in the second example, the dataset comes in correctly on the first, and is wrapped onto separate rows on the second.

The way to ensure this doesn't happen is to add column headers in your CSV, or to define the proper amount of columns using the col.name argument of the read.csv function.

Upvotes: -1

Related Questions