Reputation: 944
I have a text file which has two tab separated "columns"
Japan<tab>Shinjuku
Australia<tab>Melbourne
United States of America<tab>New York
Australia<tab>Canberra
Australia<tab>Sydney
Japan<tab>Tokyo
I read this file into an RDD and perform the following operation
val myFile = sc.textFile("/user/abc/textfile.txt")
myFile.map(str => str.split("\t")).collect()
which results in
Array[Array[String]] = Array(Array(Japan,Tokyo), Array(United States of America,Washington DC), Array(Australia,Canberra))
But what I want is not Array[Array[String]]
but Array[(String, String)]
, so I tried the following
myFile.map(str => str.split("\t")).map(arr => (arr[0], arr[1])).collect
And got the following error
<console>:1: error: identifier expected but integer literal found.
myFile.map(str => str.split("\t")).map(arr => (arr[0], arr[1])).collect
^
Could anyone help me with this? What I want is a list of (country, city) so I can perform the following operation
ListThatIWant(Country, City)
.map(a => (a._1, 1))
.reduceByKey(_+_)
.reducebyKey((a, b) => if(a>b) a else b)
This would give me the country that has the most number of cities in the text filealong with the number of cities/ occurrences in said file.
Upvotes: 1
Views: 1573
Reputation: 23099
Here is the simple example with your data replaced with ;
val data = spark.sparkContext.parallelize(
Seq(
("pan;Shinjuku"),
("Australia;Melbourne"),
("United States of America;New York"),
("Australia;Canberra"),
("Australia;Sydney"),
("Japan;Tokyo")
))
val exRDD = data.cache()
val result = exRDD.map(
rec =>
(rec.split(";")(0),rec.split(";")(1)))
result.foreach(println)
Output:
(pan,Shinjuku)
(Australia,Melbourne)
(United States of America,New York)
(Australia,Canberra)
(Australia,Sydney)
(Japan,Tokyo)
This should work similar for as well. You were trying to access the array with wrong brackets.
Hope this helps
Upvotes: 1
Reputation: 41957
In scala
unlike java
, elements of array is accessed using ()
not []
So the correct way is
myFile.map(str => str.split("\t")).map(arr => (arr(0), arr(1))).collect
Upvotes: 3