Amber
Amber

Reputation: 944

Spark- Text File to (String, String)

I have a text file which has two tab separated "columns"

Japan<tab>Shinjuku
Australia<tab>Melbourne
United States of America<tab>New York
Australia<tab>Canberra
Australia<tab>Sydney
Japan<tab>Tokyo

I read this file into an RDD and perform the following operation

val myFile = sc.textFile("/user/abc/textfile.txt")
myFile.map(str => str.split("\t")).collect()

which results in

Array[Array[String]] = Array(Array(Japan,Tokyo), Array(United States of America,Washington DC), Array(Australia,Canberra))

But what I want is not Array[Array[String]] but Array[(String, String)], so I tried the following

myFile.map(str => str.split("\t")).map(arr => (arr[0], arr[1])).collect

And got the following error

<console>:1: error: identifier expected but integer literal found.
   myFile.map(str => str.split("\t")).map(arr => (arr[0], arr[1])).collect
                                                     ^

Could anyone help me with this? What I want is a list of (country, city) so I can perform the following operation

ListThatIWant(Country, City)
    .map(a => (a._1, 1))
        .reduceByKey(_+_)
            .reducebyKey((a, b) => if(a>b) a else b)

This would give me the country that has the most number of cities in the text filealong with the number of cities/ occurrences in said file.

Upvotes: 1

Views: 1573

Answers (2)

koiralo
koiralo

Reputation: 23099

Here is the simple example with your data replaced with ;

val data = spark.sparkContext.parallelize(
  Seq(
    ("pan;Shinjuku"),
    ("Australia;Melbourne"),
      ("United States of America;New York"),
      ("Australia;Canberra"),
      ("Australia;Sydney"),
      ("Japan;Tokyo")
  ))

val exRDD = data.cache()
val result = exRDD.map(
    rec =>
      (rec.split(";")(0),rec.split(";")(1)))

result.foreach(println)

Output:

(pan,Shinjuku)
(Australia,Melbourne)
(United States of America,New York)
(Australia,Canberra)
(Australia,Sydney)
(Japan,Tokyo)

This should work similar for as well. You were trying to access the array with wrong brackets.

Hope this helps

Upvotes: 1

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

In scala unlike java, elements of array is accessed using () not [] So the correct way is

myFile.map(str => str.split("\t")).map(arr => (arr(0), arr(1))).collect

Upvotes: 3

Related Questions