Reputation: 35928
I have a log file that I would like to analyze using Spark SQL. The format of the log file is like this:
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
I have a regular expression pattern that I can use to parse the data:
Pattern.compile("""^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] \"(\S+) (\S+) (\S+)\" (\d{3}) (\d+)""")
Additionally, I've also created the case class:
case class LogSchema(ip: String, client: String, userid: String, date: String, method: String, endpoint: String, protocol: String, response: String, contentsize: String)
However, I'm unable to convert this into a table on which I can run spark sql queries.
How can I use the regex pattern to parse the data and put it in a table?
Upvotes: 1
Views: 1865
Reputation: 41957
Say you have your log file in /home/user/logs/log.txt
, then you can use the following logic to get the table
/dataframe
from the log file.
val rdd = sc.textFile("/home/user/logs/log.txt")
val pattern = Pattern.compile("""^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] \"(\S+) (\S+) (\S+)\" (\d{3}) (\d+)""")
val df = rdd.map(line => pattern.matcher(line)).map(elem => {
elem.find
LogSchema(elem.group(1), elem.group(2), elem.group(3), elem.group(4), elem.group(5), elem.group(6), elem.group(7), elem.group(8), elem.group(9))
}).toDF()
df.show(false)
You should have following dataframe
+-------------+------+------+--------------------------+------+--------+--------+--------+-----------+
|ip |client|userid|date |method|endpoint|protocol|response|contentsize|
+-------------+------+------+--------------------------+------+--------+--------+--------+-----------+
|71.19.157.174|- |- |24/Sep/2014:22:26:12 +0000|GET |/error |HTTP/1.1|404 |505 |
+-------------+------+------+--------------------------+------+--------+--------+--------+-----------+
I have used the case class
you've provided.
case class LogSchema(ip: String, client: String, userid: String, date: String, method: String, endpoint: String, protocol: String, response: String, contentsize: String)
Upvotes: 4