Applying a custom file parser on a PySpark RDD

Question

I have a set of custom log files, that I need to parse. I am currently working in Azure Databricks, but am quite new in using PySpark. The log files are hosted within an Azure Blob Storage Account, which is mounted to our Azure Databricks instance.

Log file example for the input:

Value_x: 1
Value_y: "Station"

col1;col2;col3;col4;
A1;B1;C1;D1;
A2;B2;C2;D2;
A3;B3;C3;D3;

Output that is a list of strings, but I can also work with a list of lists.

['A1;B1;C1;D1;1;station',
'A2;B2;C2;D2;1;station',
'A3;B3;C3;D3;1;station']

The snippet of code to apply these transformations.

def custom_parser(file, content):
    content_ = content.replace('"', '').replace('
', '').split('
')
    content_ = [line for line in content_ if len(line) > 0]

    x = content_[0].split('Value_x:')[-1].strip()
    y = content_[0].split('Value_y:')[-1].strip()

    content_ = content_[3:]
    content_ = [line + ';'.join([x,y]) for line in content_]
    return content_

from pyspark import SparkConf
from pyspark.context import SparkContext

sc = SparkContect.getOrCreate(SparkConf)

files = sc.wholeTextFiles('spam/eggs/').collect()

parsed_content = []
for file, content in files:
    parsed_content += custom_parser(file, content)

I have developed a custom_parser function to handle the content of these log files. But I am left with some questions:

Can I apply this custom_parser action directly to the Spark RDD returned by sc.wholeTextFiles so I can use the parallelization features of Spark?
Is parsing the data in such an ad-hoc method the most performant method?

Applying a custom file parser on a PySpark RDD

Answers (1)

Related Questions