Anthony Damico
Anthony Damico

Reputation: 6134

how to edit or modify or change a single line in a large text file with R

i'm reading some large text files into databases with R but they contain illegal field names for the database software. the column names of the large text files are just in the first row -- is it possible to edit only that first row without cycling through every single row in the file (which seems like a waste of resources)?

here are two examples of what i'm trying to do with some example data. the first reads everything into ram - so that won't work for my large data tables. the second would work, but it's slow because it processes every line in the file.

i suppose it's important that the solution work across platforms and not require the installation of external software (aside from R packages), just because i'll be sharing this script with others and would rather not ask them to perform more steps than necessary. i'm looking for the fastest way to do this within R only :)

# create two temporary files
tf <- tempfile() ; tf2 <- tempfile()

# write the mtcars data table to a file on the disk
write.csv( mtcars , tf )

# look at the first three lines
readLines( tf , n = 3 )

# read in the entire table
z <- readLines( tf )

# make the only substitution i care about
z[1] <- gsub( 'disp' , 'newvar' , z[1] )

# write the entire table back out to the table
writeLines( z , tf2 )

# confirm the replacement
readLines( tf2 , 2 )
# done!

# # # # # # # OR

# blank out the output file
file.remove( tf2 )

# create a file connection to the text file
incon <- file( tf , "r" )

# create a second file connection to the secondary temporary file
outcon <- file( tf2 , "w" )

# read in one line at a time
while( length( one.line <- readLines( incon , 1 ) ) > 0 ){

    # make the substitution on every line
    one.line <- gsub( 'disp' , 'newvar' , one.line )

    # write each line to the second temporary file
    writeLines( one.line , outcon )
}

# close the connections
close( incon ) ; close( outcon )

# confirm the replacement
readLines( tf2 , 2 )
# done!

Upvotes: 11

Views: 11639

Answers (3)

Lauri
Lauri

Reputation: 1

Have you tried:

iocon <- file("originalFile","r+")
header <- readLines(iocon,n=1)
header <- gsub('disp', 'newvar', header)
writeLines(header, con=iocon)

This would just overwrite the first line and depending on how it manages system resources could be very efficient. Be sure to have backup.

Upvotes: -1

Blue Magister
Blue Magister

Reputation: 13363

Why don't you edit just the header, and then read the rest in chunks? I don't know how big this file is, but perhaps in blocks of lines (I've guessed 10000). Depending on how much memory you have you can adjust this to be bigger or smaller.

##setup
tf <- tempfile(); tf2 <- tempfile()
write.csv(mtcars,tf)

fr <- file(tf, open="rt") #open file connection to read
fw <- file(tf2, open="wt") #open file connection to write 
header <- readLines(f,n=1) #read in header
header <- gsub( 'disp' , 'newvar' , header) #modify header    
writeLines(header,con=fw) #write header to file
while(length(body <- readLines(fr,n=10000)) > 0) {
  writeLines(body,fw) #pass rest of file in chunks of 10000
}
close(fr);close(fw) #close connections
#unlink(tf);unlink(tf2) #delete temporary files

It should be faster because R will run through the while loop every 10000 lines instead of every single line. Additionally, R will call gsub on just the line you want, instead of every line, saving you R time. R can't edit a file "in-place", so to speak, so there is no way around reading and copying the file. If you have to do it in R, then make your chunks as big as memory allows and then pass your file through.

I saw a 3x performance difference between the two ways:

#test file creation ~3M lines
tf <- tempfile(); tf2 <- tempfile()
fw <- file(tf,open="wt")
sapply(1:1e6,function(x) write.csv(mtcars,fw))
close(fw)

#my way
system.time({
fr <- file(tf, open="rt") #open file connection to read
fw <- file(tf2, open="wt") #open file connection to write 
header <- readLines(f,n=1) #read in header
header <- gsub( 'disp' , 'newvar' , header) #modify header    
writeLines(header,con=fw) #write header to file
while(length(body <- readLines(fr,n=10000)) > 0) {
  writeLines(body,fw) #pass rest of file in chunks of 10000
}
close(fr);close(fw) #close connections
})    
#   user  system elapsed 
#  32.96    1.69   34.85 

#OP's way
system.time({
incon <- file( tf , "r" )
outcon <- file( tf2 , "w" )
while( length( one.line <- readLines( incon , 1 ) ) > 0 ){
    one.line <- gsub( 'disp' , 'newvar' , one.line )
    writeLines( one.line , outcon )
}
close( incon ) ; close( outcon )
})
#   user  system elapsed 
# 104.36    1.92  107.03 

Upvotes: 5

eddi
eddi

Reputation: 49448

You're using the wrong tool for this. Use some command line tool instead. E.g. using sed, smth like sed -i '1 s/disp/newvar/' file should do. And if you have to do this in R, use

filename = 'myfile'
scan(pipe(paste("sed -i '1 s/disp/newvar/' ", filename, sep = "")))

Here's a windows-specific version:

filename = 'myfile'
tf1 = tempfile()
tf2 = tempfile()

# read header, modify and write to file
header = readLines(filename, n = 1)
header = gsub('disp', 'newvar', header)
writeLines(header, tf1)

# cut the rest of the file to a separate file
scan(pipe(paste("more ", filename, " +1 > ", tf2)))

# append the two bits together
file.append(tf1, tf2)

# tf1 now has what you want

Upvotes: 8

Related Questions