How to append a Header value from file as a extra column in csv file using pyspark for 1000 files

Question

I have been trying to filter out a header line with #Id in the beginning and add the id number as a column to the file_name being processed. Below are sample files to be processed

File 1:

#sample first line
#Id: abcdef
col1,col2,col3
1,2,3
2,3,3
4,5,6

File 2:

#sample first line
#Id: ghjklo
col1,col2,col3
5,1,3
2,5,8
8,0,4

When I try to construct the dataframe and print the results I am able to add the filename as a column using the below snippet.

par_df = spark.read.schema(schema) \
                    .option("header", "true") \
                    .format("com.databricks.spark.csv") \
                    .option("mode", "DROPMALFORMED")\
                    .csv("s3a://" + bucket "/"+prefix+"/").withColumn("FileName", func.input_file_name())

This filters out the header info and below is the snippet to print the result.

parsed_diff_df = par_df.select(
    par_df['col1'],
    par_df['col2'])    
parsed_diff_df.registerTempTable("parsed_diff_df_table")
results = sqlContext.sql("select col1, col2, FileName from "                        
                             "parsed_diff_df_table").collect()

This is the result I have got and unable to append the Id column as it is already filtered out.

1,2,3,File1
2,3,3,File1
4,5,6,File1
5,1,3,File2
2,5,8,File2
8,0,4,File2

Intended result is below.

1,2,3,abcdef,File1
2,3,3,abcdef,File1
4,5,6,abcdef,File1
5,1,3,ghjklo,File2
2,5,8,ghjklo,File2
8,0,4,ghjklo,File2

I have also tried this but no luck.

   rdd = sc.textFile("s3a://" + bucket + "/"+prefix+"/").flatMap(lambda line: line.split("
")).filter(lambda line: '#' in line)

   results = rdd.collect()
   for row in results:
       print row

MaFF · Accepted Answer

You can map the FileName of each file to it's id:

Let's write a function to extract the id value:

import re
def extract_id(l):
    return re.search('#Id: ([a-z]+)\n', line).group(1)

Let's read the files as RDDs:

file_id = sc.wholeTextFiles("/user/at967214/test.csv").filter(lambda l: l[1][0]=='#').map(lambda l: [l[0], extract_id(l[1])])

And now the dataframe:

file_id_df = spark.createDataFrame(file_id, ["FileName", "id"])

Now you can join it with your first dataframe

par_df.join(file_id_df, "FileName", "inner")

How to append a Header value from file as a extra column in csv file using pyspark for 1000 files

Answers (2)

Related Questions