Krantz
Krantz

Reputation: 1493

How to convert the output from readLines into a dataframe

I am attempting to use readLines to import a 17.6GB csv file into R. I have tried several approaches discussed here, here, here, and elsewhere and readLines seems to be the only approach that effectively at least can get the data into R.

The problem is that I am unable to convert the output from readLines into a data frame which I can use in my analysis. The answers to a related question here are not helping me solve my problem.

Here is my sample data:

write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")

dt<-data.frame(myid=1:10,var=runif(10))
dt

myid       var
1     1 0.5949020
2     2 0.8515591
3     3 0.8139010
4     4 0.3804234
5     5 0.4923082
6     6 0.9933775
7     7 0.1740895
8     8 0.8342808
9     9 0.3958154
10   10 0.9690561

creating chunks:

file_i <- file("temp.csv","r")
chunk_size <- 100000 # choose the best size for you
x<- readLines(file_in, n=chunk_size)

Opening the output from readLines in R:

View(x)
x
 [1] "\"\",\"myid\",\"var\""      
 [2] "\"1\",1,0.594902001088485"  
 [3] "\"2\",2,0.851559089729562"  
 [4] "\"3\",3,0.81390100880526"   
 [5] "\"4\",4,0.380423351423815"  
 [6] "\"5\",5,0.492308202432469"  
 [7] "\"6\",6,0.993377464590594"  
 [8] "\"7\",7,0.174089450156316"  
 [9] "\"8\",8,0.834280799608678"  
[10] "\"9\",9,0.395815373631194"  
[11] "\"10\",10,0.969056134112179"

Thanks in advance for any help

Upvotes: 1

Views: 2307

Answers (2)

user10915156
user10915156

Reputation:

Given the output you get after readLines, this must be the content of your CSV file:

"","myid","var"
"1","1","0.5949020"
"2","2","0.8515591"
"3","3","0.8139010"
"4","4","0.3804234"
"5","5","0.4923082"
"6","6","0.9933775"
"7","7","0.1740895"
"8","8","0.8342808"
"9","9","0.3958154"
"10","10","0.9690561"

That is, your values are comma separated and enclosed in double quotes. When I read in this file, I get your output:

dat
 [1] "\"\",\"myid\",\"var\""       "\"1\",\"1\",\"0.5949020\""  
 [3] "\"2\",\"2\",\"0.8515591\""   "\"3\",\"3\",\"0.8139010\""  
 [5] "\"4\",\"4\",\"0.3804234\""   "\"5\",\"5\",\"0.4923082\""  
 [7] "\"6\",\"6\",\"0.9933775\""   "\"7\",\"7\",\"0.1740895\""  
 [9] "\"8\",\"8\",\"0.8342808\""   "\"9\",\"9\",\"0.3958154\""  
[11] "\"10\",\"10\",\"0.9690561\""

So what you need to do is

  • split at the commas
    with unlist(strsplit(..., split = ",")

and

  • replace the escaped double quotes
    with gsub("\"", "", ...)

which gives us:

unlist(strsplit(gsub("\"", "", dat), split = ","))

 [1] ""          "myid"      "var"       "1"         "1"         "0.5949020" "2"        
 [8] "2"         "0.8515591" "3"         "3"         "0.8139010" "4"         "4"        
[15] "0.3804234" "5"         "5"         "0.4923082" "6"         "6"         "0.9933775"
[22] "7"         "7"         "0.1740895" "8"         "8"         "0.8342808" "9"        
[29] "9"         "0.3958154" "10"        "10"        "0.9690561"

Upvotes: 0

Rui Barradas
Rui Barradas

Reputation: 76575

Here is a complete sequence of instructions to transform the data as you posted into a dataframe.

set.seed(1234)    # Make the results reproducible

write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")

dat <- readLines("temp.csv")
df1 <- strsplit(dat[-1], ",")
df1 <- do.call(rbind, df1)
df1 <- df1[,-1]
df1 <- as.data.frame(df1)
df1[] <- lapply(df1, function(x) as.numeric(as.character(x)))

names(df1) <- gsub('"', '', strsplit(dat[1], ',')[[1]][-1], fixed = TRUE)
df1

Upvotes: 3

Related Questions