manuthefil
manuthefil

Reputation: 33

Scala : bug with getLines?

I'm facing a problem on a very simple file usage in scala I don't understand if this is from a bug or a misunderstanding what I'm doing... Even reproducible from a worksheet in scala/eclipse IDE. I'm using IDE4.6.1 and scala 2.12.2 Code is very simple :

//********************************
import scala.io.Source
import java.io.File
import java.io.PrintWriter

object Embed {

  val filename = "proteins.csv"
  val handler = Source.fromFile(filename)

  val header:String = handler.getLines().next()
  println (">"+header)
  val header2:String = handler.getLines().next()
  println (">"+header2)

  val header3:String = handler.getLines().next()
  println (">"+header3)
}
//**********************

first 3 lines of the file are a bit long and of non sense for non bio specialists :

Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
261,247,P0AFG4|ODO1_ECOL6,200.00,39,30,30,Carbamidomethylation; Deamidation (NQ); Oxidation (M),1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,105062,2-oxoglutarate dehydrogenase E1 component OS=Escherichia coli O6:H1 (strain CFT073 / ATCC 700928 / UPEC) GN=sucA PE=3 SV=1
287,657,B7NDL4|MDH_ECOLU,200.00,54,14,1,Carbamidomethylation; Deamidation (NQ); Oxidation (M),6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,32336,Malate dehydrogenase OS=Escherichia coli O17:K52:H18 (strain UMN026 / ExPEC) GN=mdh PE=3 SV=1

I won't go into this file details but it is a 3600 lines file, each containing 20 fields separated by commas and a '' end of line. First line is teh header. I tried also with only and only with same result : First line is read correctly but second line read is only the final part of the 8th line in the file, and so on then I cannot read/parse my file :

Following is the result I get

   val filename = "proteins.csv"
                                                  //> filename  : String = proteins.csv
  val handler = Source.fromFile(filename)         //> handler  : scala.io.BufferedSource = non-empty iterator

  val header:String = handler.getLines().next()   //> header  : String = Protein Group,Protein ID,Accession,Significance,Coverage 
                                                  //| (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity
                                                  //| ,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity
                                                  //| ,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Descrip
                                                  //| tion
  println (">"+header)                            //> >Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Uni
                                                  //| que,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,
                                                  //| Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity
                                                  //| ,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
  val header2:String = handler.getLines().next()  //> header2  : String = TCC 700928 / UPEC) GN=fumA PE=3 SV=2
  println (">"+header2)                           //> >TCC 700928 / UPEC) GN=fumA PE=3 SV=2

  val header3:String = handler.getLines().next()  //> header3  : String = n SE11) GN=zapB PE=3 SV=1
  println (">"+header3)                           //> >n SE11) GN=zapB PE=3 SV=1

An idea what I do wrong ? Many thanks for helping No hurry : this is part of an attempt to use scala and I'll now go back to Python for doing the job !

Upvotes: 2

Views: 111

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

Your mistake is that you have called three times handler.getLines() i.e. BufferedLineIterator is instantiated three times and each one is calling next meaning that each instances are trying to read from the same source. And thats the reason you are getting random outputs

The correct way is to create only one instance of handler.getLines() and call next on it

val linesIterator = handler.getLines()

val header:String = linesIterator.next()
println (">"+header)
val header2:String = linesIterator.next()
println (">"+header2)

val header3:String = linesIterator.next()
println (">"+header3)

More precisely, you don't even need to call next() by doing

for(lines <- handler.getLines()){
  println(">"+lines)
}

Upvotes: 1

Alexey Sirenko
Alexey Sirenko

Reputation: 462

If I understand you correctly the problem is that every time you call handler.getLines() you receive a new Iterator[String] object that by default points to the first line of the CSV file. You should try something like this:

val lineIterator = Source.fromFile("proteins.csv").getLines() // Get the iterator object
val firstLine = lineIterator.next()
val secondLine = lineIterator.next()
val thirdLine = lineIterator.next()

Or this:

val lines = Source.fromFile("proteins.csv").getLines().toIndexedSeq // Convert iterator to the list of lines
val n = 2
val nLine = lines(n)
println(nLine)

Upvotes: 1

Related Questions