Adding line numbers when parsing many CSV files with Spark

Question

I am currently having Spark parse a large number of small CSV-files into one large dataframe. Something along the lines of

df = spark.read.format("csv").load("file*.csv")

Because of how the data set being parsed is structured I need the line numbers within the corresponding source CSV-file of every row in df. Is there some simple way of achieving this (preferably without resorting to reconstructing them afterward by a combination of grouping on input_file_name() and zipwithindex())?

For example if

# file1.csv
col1, col2
A, B
C, D

and

# file2.csv
col1, col2
E, F
G, H

I need a resulting data frame equivalent to

row, col1, col2
1, A, B
2, C, D
1, E, F
2, G, H

Adding line numbers when parsing many CSV files with Spark

Answers (1)

Related Questions