How to read every log line to match a regex pattern using spark?

Question

The following program throws an error

from pyparsing import Regex, re
from pyspark import SparkContext
sc = SparkContext("local","hospital")
LOG_PATTERN ='(?P[^ ;]+);(?P[^ ;]+);(?P[^ ;]+);(?P[^;]+);(?P[^ ;]+);(?P[^ ;]+)'
logLine=sc.textFile("C:\TestLogs\Hospital.log").cache()
#logLine='1;35654423;30-12-2010:11.02;register request;Pete;50'
for line in logLine.readlines():
    match = re.search(LOG_PATTERN,logLine)
    Case_ID = match.group(1)
    Event_ID = match.group(2)
    Date_Time = match.group(3)
    Activity = match.group(4)
    Resource = match.group(5)
    Costs = match.group(6)
    print Case_ID
    print Event_ID  
    print Date_Time
    print Activity
    print Resource
    print Costs

Error:

Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in for line in logLine.readlines(): AttributeError: 'RDD' object has no attribute 'readlines'

If i add the open function to read the file then i get the following error:

Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in f = open(logLine,"r") TypeError: coercing to Unicode: need string or buffer, RDD found

Can't seem to figure out how to read line by line and extract words that match the pattern. Also if i pass only a single logline logLine='1;35654423;30-12-2010:11.02;register request;Pete;50' it works. I'm new to spark and know only basics in python. Please help.

Matei Florescu · Accepted Answer

You are mixing things up. The line

logLine=sc.textFile("C:\TestLogs\Hospital.log")

creates an RDD, and RDDs do not have a readlines() method. See the RDD API here:

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

You can use collect() to retrieve the content of the RDD line by line. readlines() is part of the standard Python file API, but you do not usually need it when working with files in Spark. You simply load the file with textFile() and then process it with RDD API, see the link above.

How to read every log line to match a regex pattern using spark?

Answers (2)

Related Questions