Reputation: 33
I am writing three List of TripleInts with 277270 rows aprox, My class TripleInts is the following:
class tripleInt (var sub:Int, var pre:Int, var obj:Int)
Additional I create each lists with Apache Jena components from an RDF file, I transform the RDF elements to ids and I store this ids in the diferent lists. Once I have the lists, I write the files with the following code:
class Indexes (val listSPO:List[tripleInt], val listPSO:List[tripleInt], val listOSP:List[tripleInt] ){
val sl = listSPO.sortBy(l => (l.sub, l.pre))
val pl = listPSO.sortBy(l => (l.sub, l.pre))
//val ol = listOSP.sortBy(l => (l.sub, l.pre))
var y1:Int=0
var y2:Int=0
var y3:Int=0
val fstream:FileWriter = new FileWriter("patSPO.dat")
var out:BufferedWriter = new BufferedWriter(fstream)
//val fstream:FileOutputStream = new FileOutputStream("patSPO.dat")
//var out:ObjectOutputStream = new ObjectOutputStream(fstream)
//out.writeObject(listSPO)
val fstream2:FileWriter = new FileWriter("patPSO.dat")
var out2:BufferedWriter = new BufferedWriter(fstream2)
/*val fstream3:FileOutputStream = new FileOutputStream("patOSP.dat")
var out3:BufferedOutputStream = new BufferedOutputStream(fstream3)*/
for ( a <- 0 to sl.size-1){
y1 = sl(a).sub
y2 = sl(a).pre
y3 = sl(a).obj
out.write((y1.toString+","+y2.toString+","+y3.toString+"\n"))
}
for ( a <- 0 to pl.size-1){
y1 = pl(a).sub
y2 = pl(a).pre
y3 = pl(a).obj
out2.write((y1.toString+","+y2.toString+","+y3.toString+"\n"))
}
out.close()
out2.close()
This process takes 30 min aprox. My pc is 16 Gb Ram, core i7. Then I don't understand why is taking a lot of time, and Is there a way to optimize this performance?
Thank you
Upvotes: 0
Views: 53
Reputation: 9100
Yes, you need to choose your data structures wisely. List
is for sequential access (Seq
), not random access (IndexedSeq
). What you are doing is O(n^2) because of indexing large List
s. The following should be much faster (O(n), and hopefully easier to read too):
class Indexes (val listSPO: List[tripleInt], val listPSO: List[tripleInt], val listOSP: List[tripleInt] ){
val sl = listSPO.sortBy(l => (l.sub, l.pre))
val pl = listPSO.sortBy(l => (l.sub, l.pre))
var y1:Int=0
var y2:Int=0
var y3:Int=0
val fstream:FileWriter = new FileWriter("patSPO.dat")
val out:BufferedWriter = new BufferedWriter(fstream)
for (s <- sl){
y1 = s.sub
y2 = s.pre
y3 = s.obj
out.write(s"$y1,$y2,$y3\n"))
}
// TODO close in finally
out.close()
val fstream2:FileWriter = new FileWriter("patPSO.dat")
val out2:BufferedWriter = new BufferedWriter(fstream2)
for ( p <- pl){
y1 = p.sub
y2 = p.pre
y3 = p.obj
out2.write(s"$y1,$y2,$y3\n"))
}
// TODO close in finally
out2.close()
}
(It would not hurt using IndexedSeq
/Vector
as inputs, but there might be constraints why List
is preferred in your case.)
Upvotes: 1