Output for read.csv()

Question

I've been trying to load a csv into R for some processing but I'm facing a strange issue while trying to read the data itself.

The csv doesnt have any headers and i'm using the following simple code to read the data

newClick <- read.csv("test.csv", header = F)

And following is the sample dataset :

10000011791441224671,V_Display,exit
10000011951441812316,V_Display,exit
10000013211441319797,V_Display,exit
1000001331441725509,V_Display,exit
10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit
10000014031441295393,V_Display,exit

The output for this data being the expected data frame of 6 obs. of 18 variables.

Here is the tricky part however. If I add another row in the dataset like

10000011791441224671,V_Display,exit
10000011951441812316,V_Display,exit
1000000191441228436,V_Display,exit
10000013211441319797,V_Display,exit
1000001331441725509,V_Display,exit
10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit
10000014031441295393,V_Display,exit

The output for this is a strange 12 obs of 3 variables. On close analysis I realised that the entire second last row got divided into 6 rows with three columns each which is weird.

Any thoughts on this?

Rich Scriven · Accepted Answer

As mentioned in the comments, this occurs because the number of columns is determined by the first five lines of input. If you're in a jam, here's a possible workaround that I have tested and seems to run well. The secret is to enter a vector for col.names that is the length of the number of columns in the data. We can get the number of columns by using count.fields(). Insert your file name for file.

## get the number of columns
ncols <- max(count.fields(file, sep = ","))
## read the data with all columns as character
df <- read.csv(file, header = FALSE, col.names = paste0("V", seq_len(ncols)))

Here is the tested code with your data:

txt <- "10000011791441224671,V_Display,exit
10000011951441812316,V_Display,exit
1000000191441228436,V_Display,exit
10000013211441319797,V_Display,exit
1000001331441725509,V_Display,exit
10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit
10000014031441295393,V_Display,exit"

ncols <- max(count.fields(textConnection(txt), sep = ","))
df <- read.csv(text = txt, header = FALSE, col.names = paste0("V", seq_len(ncols)))
dim(df)
# [1]  7 18

Output for read.csv()

Answers (2)

Related Questions