Reputation: 1
I'm using Scala on the top of an Hadoop/Spark framework.
Actualy my datas are of this type:
RDD[(List[(String, Int)], Long)]
And, this is an example of the first two rows in this datalake:
(List(("COD_LOCALE_PROGETTO",0), ("CUP",1), ("OC_TITOLO_PROGETTO",2), ("OC_SINTESI_PROGETTO",3), ("OC_LINK",4), ("OC_COD_CICLO",5), ("OC_DESCR_CICLO",6), ("OC_COD_TEMA_SINTETICO",7), ("OC_TEMA_SINTETICO",8), ("COD_GRANDE_PROGETTO",9), ("DESCRIZIONE_GRANDE_PROGETTO",10)),0)
(List(("10CAPORTO-POZZUOLI 1",0), ("J86G08000450003",1), ("INTERVENTO C11 2° LOTTO ¿ 1° STRALCIO FUNZIONALE ¿COLLEGAMENTO TRA TANGENZIALE DI NAPOLI (VIA CAMPANA), RETE VIARIA COSTIERA E PORTO DI POZZUOLI""",2), ("INTERVENTO C11 2° LOTTO ¿ 1° STRALCIO FUNZIONALE ¿COLLEGAMENTO TRA TANGENZIALE DI NAPOLI (VIA CAMPANA), RETE VIARIA COSTIERA E PORTO DI POZZUOLI""",3), ("www.opencoesione.gov.it/progetti/10caporto-pozzuoli-1",4), (1,5), ("Ciclo di programmazione 2007-2013",6), ("07",7), ("Trasporti e infrastrutture a rete",8), (" ",9), (" ",10)),1)
In the real case, each rows lasts for 194 columns, and I have a total of more than 1.6 Milion records.
With this dataset, I would like to populate a new List, of the type:
List[(String, Int, Int, Int)]
Where first "Int" is every single field of each line (COD_LOCALE_PROGETTO, CUP...), the second field is the size of each of this fields (19, 3, ...) the third one is the position of each field, already codified in the variable, just after the string, and the last "Int" is the position of each row in the whole dataset.
I tried this script:
| val Dimensione = item._1.size;
| for(i <- 0 until Dimensione){
| ComponentiOpenCoesione :+= (item._1(i)._1.replace("\"","").toString,
| item._1(i)._1.replace("\"","").toString.size,
| item._1(i)._2.toInt,
| item._2.toLong)}
| })
But it fail, The List of tuples , that I called "ComponentiOpenCoesione" does not fills.
Lastly, this variable has been defined in this way:
var ComponentiOpenCoesione : List[(String, Int, Int, Long)] = List();
Somebody can help me? How an I extract and load datas from the RDD to the List?
Thank you so much.
Upvotes: 0
Views: 340
Reputation: 5078
In scala, the last statement of your function is returned. Here your function will return nothing as its last statement is the for
loop that doesn't return anything.
To correct it you just have to put ComponentiOpenCoesione
as your last statement. So, if you just plan to map your RDD[(List[(String, Int)], Long)]
to get RDD[List[(String, Int, Int, Long)]]
, your code should be:
rdd.map(item => {
var ComponentiOpenCoesione: List[(String, Int, Int, Long)] = List();
val Dimensione = item._1.size;
for (i <- 0 until Dimensione) {
ComponentiOpenCoesione :+= (item._1(i)._1.replace("\"", "").toString,
item._1(i)._1.replace("\"", "").toString.size,
item._1(i)._2.toInt,
item._2.toLong)
}
ComponentiOpenCoesione
})
You can look at Return in Scala question's answers to grasp how the values are returned in scala.
Upvotes: 0