Jatin
Jatin

Reputation: 31744

Spark generate a dataframe from two json columns

I have a dataframe with two columns. Each column contains json.

cola colb
{"name":"Adam", "age": 23} {"country" : "USA"}

I wish to convert it to:

cola_name cola_age colb_country
Adam 23 USA

How do I do this?


The approach I have in mind is: In the original dataframe, If I can merge both the json to a single json object. I can then obtain the intended result

spark.read.json(df.select("merged_column").as[String])

But cant find an easy way of merging two json object to single json object in spark

Update: The contents of the json is not known pre-hand. Looking for a way to auto-detect schema

Upvotes: 0

Views: 708

Answers (1)

ARCrow
ARCrow

Reputation: 1857

I'm more familiar with pyspark syntax. I think this works:

import pyspark.sql.functions as f
from pyspark.sql.types import *

schema_cola = StructType([
  StructField('name', StringType(), True),
  StructField('age', IntegerType(), True)
])
schema_colb = StructType([
  StructField('country', StringType(), True)
])

df = spark.createDataFrame([('{"name":"Adam", "age": 23}', '{"country" : "USA"}')], ['cola', 'colb'])
display(df
        .withColumn('cola_struct', f.from_json(f.col('cola'), schema_cola))
        .withColumn('colb_struct', f.from_json(f.col('colb'), schema_colb))
        .select(f.col('cola_struct.*'), f.col('colb_struct.*'))
       )

The output looks like this: enter image description here

Upvotes: 1

Related Questions