Taie
Taie

Reputation: 1189

Adding data to an existing apache spark dataframe from a csv file

i have a spark dataframe that has two columns: name, age as follows:

[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]

The dataframe was created using

sqlContext.createDataFrame()

What i need to do next is to add a third column 'UserId' from an external 'csv' file. The external file has several columns but i need to include the first column only, which is the 'UserId':

enter image description here

The number of records in both data sources is the same. I am using a standalone pyspark version on windows os. The final result should be a new dataframe with three columns: UserId, Name, Age.

Any suggestion?

Upvotes: 0

Views: 2672

Answers (3)

Viv
Viv

Reputation: 1584

You can create a new dataframe from the csv.

    sc = SparkContext.getOrCreate()
    sqlContext = SQLContext(sc)

    # Import the csv file to the SparkSQL table.

    df = sqlContext.read.csv("abc.csv")
    df.createOrReplaceTempView(table_a)

    # Create a new dataframe with only the columns required. In your case only user id
     df_1 = spark.sql("select userid from table_a")

    #Now do a join with the existing dataframe which has the original data. ( [Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] )
    # Lets call the original alice-bob dataframe as df_ori. So,

    df_result = df_ori.join(df_1, how=inner, on= (any column cols if there are any or index row)

Upvotes: 0

Taie
Taie

Reputation: 1189

I used pandas to make this work. It allows to join dataframes in many different ways.

1) We need first to import only that extra column (after we remove headers, although this can also be done after the import) and convert it into an RDD

from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
userid_rdd = sc.textFile("C:……/userid.csv").map(lambda line: line.split(","))

2) Convert the 'userid' RDD into a spark dataframe

userid_df = userid_rdd.toDF(['userid'])
userid_df.show()

3) Convert the 'userid' dataframe into a pandas dataframe

userid_toPandas = userid_df.toPandas()
userid_toPandas

4) Convert the ‘predictions’ dataframe (an existing dataframe) into a pandas dataframe

predictions_toPandas = predictions.toPandas() 
predictions_toPandas

5) Combine the two pandas dataframes into one new dataframe using ‘concat’

import pandas as pd
result = pd.concat([userid_toPandas, predictions_toPandas], axis = 1, ignore_index = True)
result

Upvotes: 1

David
David

Reputation: 220

You can do this by join of two data frame but for that you need to have in booth tables either ids or other keys. I would suggest to just copy it to an excel file if the location of lines is the same other wise you do not have enough information to merge them.

Upvotes: 0

Related Questions