Parsing a text file to split at specific positions using pyspark

Question

I have a text file which is not delimited by any character and I want to split it at specific positions so that I can convert it to a 'dataframe'.Example data in file1.txt below:

1JITENDER33
2VIRENDER28
3BIJENDER37

I want to split the file so that positions 0 to 1 goes into first column, positions 2 to 9 goes to second column and 10 to 11 goes to third column so that I can finally convert it into a spark dataframe.

vikrant rana · Accepted Answer

you can use a below python code to read onto your input file and make it delimited using csv writer and then can read it into dataframe or can load it to your hive external table.

vikrant> cat inputfile
1JITENDER33
2VIRENDER28
3BIJENDER37

import csv
fname_in = '/u/user/vikrant/inputfile'
fname_out = '/u/user/vikrant/outputfile.csv'
cols = [(0,1), (1,9), (9,11)]
with open(fname_in) as fin, open(fname_out, 'wt') as fout:
    writer = csv.writer(fout, delimiter=",", lineterminator="
")
    for line in fin:
        line = line.rstrip()  # removing the '
' and other trailing whitespaces
        data = [line[c[0]:c[1]] for c in cols]
        print("data:",data)
        writer.writerow(data)


vikrant> cat outputfile.csv
1,JITENDER,33
2,VIRENDER,28
3,BIJENDER,37

you can also make this code as a function to some python class and then further import that class into pyspark application code and can transform your plain text file to some csv file format. let me know incase you need more help on this.

Parsing a text file to split at specific positions using pyspark

Answers (1)

Related Questions