user4479371
user4479371

Reputation:

How to add a column from different data frame : Scala Frame

How to add/append a column form different data frame ? I am trying to find the percentile of placeName which are rated 3 and above.

// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
//df.show()


val res =  df.withColumn("visited", explode($"visited"))

val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))

val result2 = res
.filter($"visited.rating" < 4)
  .groupBy($"requestId", $"visited.placeName")  
  .agg(count("*").alias("top"))

result1.show()

result2.show()

val finalResult = result1.join(result2, result1("placeName") <=> result2("placeName") && result1("customerId") <=> result2("customerId"), "outer").show()

result1 has rows with total and result2 has total after filtering. Now I am trying to find :

 sqlContext.sql("select top/total as percentile from temp groupBy placeName") 

But finalResult has duplicate columns placeName and customerId. Can someone tell me what I am doing wrong here ? Also is there a way to do this without doing join ?

My Schema :

 {
        "country": "France",
        "customerId": "France001",
        "visited": [
            {
                "placeName": "US",
                "rating": "2",
                "famousRest": "N/A",
                "placeId": "AVBS34"

            },
              {
                "placeName": "US",
                "rating": "3",
                "famousRest": "SeriousPie",
                "placeId": "VBSs34"

            },
              {
                "placeName": "Canada",
                "rating": "3",
                "famousRest": "TimHortons",
                "placeId": "AVBv4d"

            }        
    ]
}

US top = 1 count = 3
Canada top = 1 count = 3


{
        "country": "Canada",
        "customerId": "Canada012",
        "visited": [
            {
                "placeName": "UK",
                "rating": "3",
                "famousRest": "N/A",
                "placeId": "XSdce2"

            },


    ]
}
UK top = 1 count = 1


{
        "country": "France",
        "customerId": "France001",
        "visited": [
            {
                "placeName": "US",
                "rating": "4.3",
                "famousRest": "N/A",
                "placeId": "AVBS34"

            },
              {
                "placeName": "US",
                "rating": "3.3",
                "famousRest": "SeriousPie",
                "placeId": "VBSs34"

            },
              {
                "placeName": "Canada",
                "rating": "4.3",
                "famousRest": "TimHortons",
                "placeId": "AVBv4d"

            }        
    ]
}

US top = 2 count = 3 Canada top = 1 count = 3

PlaceName percnetile US (1+1+2)/(3+1+3) *100 Canada (1+1)/(3+3) *100 UK 1 *100

Schema:

root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |   |-- placeId: string (nullable = true)
|    |   |-- placeName: string (nullable = true) 
|    |   |-- famousRest: string (nullable = true)
|    |   |-- rating: string (nullable = true)

Upvotes: 0

Views: 726

Answers (1)

user6022341
user6022341

Reputation:

Also is there a way to do this without doing join ?

No

Can someone tell me what I am doing wrong here ?

Nothing. If you don't need both use this:

result1.join(result2, List("placeName","customerId"), "outer")

Upvotes: 1

Related Questions