Reputation: 33
I'm facing a problem on a very simple file usage in scala I don't understand if this is from a bug or a misunderstanding what I'm doing... Even reproducible from a worksheet in scala/eclipse IDE. I'm using IDE4.6.1 and scala 2.12.2 Code is very simple :
//********************************
import scala.io.Source
import java.io.File
import java.io.PrintWriter
object Embed {
val filename = "proteins.csv"
val handler = Source.fromFile(filename)
val header:String = handler.getLines().next()
println (">"+header)
val header2:String = handler.getLines().next()
println (">"+header2)
val header3:String = handler.getLines().next()
println (">"+header3)
}
//**********************
first 3 lines of the file are a bit long and of non sense for non bio specialists :
Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
261,247,P0AFG4|ODO1_ECOL6,200.00,39,30,30,Carbamidomethylation; Deamidation (NQ); Oxidation (M),1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,105062,2-oxoglutarate dehydrogenase E1 component OS=Escherichia coli O6:H1 (strain CFT073 / ATCC 700928 / UPEC) GN=sucA PE=3 SV=1
287,657,B7NDL4|MDH_ECOLU,200.00,54,14,1,Carbamidomethylation; Deamidation (NQ); Oxidation (M),6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,32336,Malate dehydrogenase OS=Escherichia coli O17:K52:H18 (strain UMN026 / ExPEC) GN=mdh PE=3 SV=1
I won't go into this file details but it is a 3600 lines file, each containing 20 fields separated by commas and a '' end of line. First line is teh header. I tried also with only and only with same result : First line is read correctly but second line read is only the final part of the 8th line in the file, and so on then I cannot read/parse my file :
Following is the result I get
val filename = "proteins.csv"
//> filename : String = proteins.csv
val handler = Source.fromFile(filename) //> handler : scala.io.BufferedSource = non-empty iterator
val header:String = handler.getLines().next() //> header : String = Protein Group,Protein ID,Accession,Significance,Coverage
//| (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity
//| ,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity
//| ,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Descrip
//| tion
println (">"+header) //> >Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Uni
//| que,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,
//| Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity
//| ,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
val header2:String = handler.getLines().next() //> header2 : String = TCC 700928 / UPEC) GN=fumA PE=3 SV=2
println (">"+header2) //> >TCC 700928 / UPEC) GN=fumA PE=3 SV=2
val header3:String = handler.getLines().next() //> header3 : String = n SE11) GN=zapB PE=3 SV=1
println (">"+header3) //> >n SE11) GN=zapB PE=3 SV=1
An idea what I do wrong ? Many thanks for helping No hurry : this is part of an attempt to use scala and I'll now go back to Python for doing the job !
Upvotes: 2
Views: 111
Reputation: 41957
Your mistake is that you have called three times handler.getLines()
i.e. BufferedLineIterator
is instantiated three times and each one is calling next
meaning that each instances are trying to read from the same source. And thats the reason you are getting random outputs
The correct way is to create only one instance of handler.getLines()
and call next
on it
val linesIterator = handler.getLines()
val header:String = linesIterator.next()
println (">"+header)
val header2:String = linesIterator.next()
println (">"+header2)
val header3:String = linesIterator.next()
println (">"+header3)
More precisely, you don't even need to call next()
by doing
for(lines <- handler.getLines()){
println(">"+lines)
}
Upvotes: 1
Reputation: 462
If I understand you correctly the problem is that every time you call handler.getLines()
you receive a new Iterator[String]
object that by default points to the first line of the CSV file. You should try something like this:
val lineIterator = Source.fromFile("proteins.csv").getLines() // Get the iterator object
val firstLine = lineIterator.next()
val secondLine = lineIterator.next()
val thirdLine = lineIterator.next()
Or this:
val lines = Source.fromFile("proteins.csv").getLines().toIndexedSeq // Convert iterator to the list of lines
val n = 2
val nLine = lines(n)
println(nLine)
Upvotes: 1