Reputation: 1189
i have a spark dataframe that has two columns: name, age as follows:
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
The dataframe was created using
sqlContext.createDataFrame()
What i need to do next is to add a third column 'UserId' from an external 'csv' file. The external file has several columns but i need to include the first column only, which is the 'UserId':
The number of records in both data sources is the same. I am using a standalone pyspark version on windows os. The final result should be a new dataframe with three columns: UserId, Name, Age.
Any suggestion?
Upvotes: 0
Views: 2672
Reputation: 1584
You can create a new dataframe from the csv.
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
# Import the csv file to the SparkSQL table.
df = sqlContext.read.csv("abc.csv")
df.createOrReplaceTempView(table_a)
# Create a new dataframe with only the columns required. In your case only user id
df_1 = spark.sql("select userid from table_a")
#Now do a join with the existing dataframe which has the original data. ( [Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] )
# Lets call the original alice-bob dataframe as df_ori. So,
df_result = df_ori.join(df_1, how=inner, on= (any column cols if there are any or index row)
Upvotes: 0
Reputation: 1189
I used pandas to make this work. It allows to join dataframes in many different ways.
1) We need first to import only that extra column (after we remove headers, although this can also be done after the import) and convert it into an RDD
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
userid_rdd = sc.textFile("C:……/userid.csv").map(lambda line: line.split(","))
2) Convert the 'userid' RDD into a spark dataframe
userid_df = userid_rdd.toDF(['userid'])
userid_df.show()
3) Convert the 'userid' dataframe into a pandas dataframe
userid_toPandas = userid_df.toPandas()
userid_toPandas
4) Convert the ‘predictions’ dataframe (an existing dataframe) into a pandas dataframe
predictions_toPandas = predictions.toPandas()
predictions_toPandas
5) Combine the two pandas dataframes into one new dataframe using ‘concat’
import pandas as pd
result = pd.concat([userid_toPandas, predictions_toPandas], axis = 1, ignore_index = True)
result
Upvotes: 1
Reputation: 220
You can do this by join of two data frame but for that you need to have in booth tables either ids or other keys. I would suggest to just copy it to an excel file if the location of lines is the same other wise you do not have enough information to merge them.
Upvotes: 0