Reputation: 789
The following program throws an error
from pyparsing import Regex, re
from pyspark import SparkContext
sc = SparkContext("local","hospital")
LOG_PATTERN ='(?P<Case_ID>[^ ;]+);(?P<Event_ID>[^ ;]+);(?P<Date_Time>[^ ;]+);(?P<Activity>[^;]+);(?P<Resource>[^ ;]+);(?P<Costs>[^ ;]+)'
logLine=sc.textFile("C:\TestLogs\Hospital.log").cache()
#logLine='1;35654423;30-12-2010:11.02;register request;Pete;50'
for line in logLine.readlines():
match = re.search(LOG_PATTERN,logLine)
Case_ID = match.group(1)
Event_ID = match.group(2)
Date_Time = match.group(3)
Activity = match.group(4)
Resource = match.group(5)
Costs = match.group(6)
print Case_ID
print Event_ID
print Date_Time
print Activity
print Resource
print Costs
Error:
Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in for line in logLine.readlines(): AttributeError: 'RDD' object has no attribute 'readlines'
If i add the open
function to read the file then i get the following error:
Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in f = open(logLine,"r") TypeError: coercing to Unicode: need string or buffer, RDD found
Can't seem to figure out how to read line by line and extract words that match the pattern.
Also if i pass only a single logline logLine='1;35654423;30-12-2010:11.02;register request;Pete;50'
it works. I'm new to spark and know only basics in python. Please help.
Upvotes: 0
Views: 2755
Reputation: 33
As answered by Matei, readlines() is Python API and sc.textFile will create an RDD, so the error that RDD has no attributes readlines().
If you have to process file using Spark APIs, you can use filter API on RDD created for pattern and then you can split the output based on delimiter.
An example as below:
logLine = sc.textFile("C:\TestLogs\Hospital.log")
logLine_Filtered = logLine.filter(lambda x: "LOG_PATTERN" in x)
logLine_output = logLine_Filtered(lambda a: a.split("<delimiter>")[0], a.split("<delimiter>")[1].....).collect()
logLine_output.first()
Dataframe would be even better
Upvotes: 1
Reputation: 1195
You are mixing things up. The line
logLine=sc.textFile("C:\TestLogs\Hospital.log")
creates an RDD, and RDDs do not have a readlines() method. See the RDD API here:
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD
You can use collect() to retrieve the content of the RDD line by line. readlines() is part of the standard Python file API, but you do not usually need it when working with files in Spark. You simply load the file with textFile() and then process it with RDD API, see the link above.
Upvotes: 2