Regex on Spark RDD[String] with Regex on multiline

Question

I'm trying a parse a log file in Spark 1.6 using scala, here is the sample data

2017-02-04 04:48:11,123 DEBUG [org.quartz.core.QuartzSchedulerThread] - 
2017-02-04 04:48:20,892 INFO [org.jasig.inspektr.audit.support.Slf4jLoggingAuditTrailManager] - 
2017-02-04 04:48:32,165 INFO [org.jasig.cas.services.DefaultServicesManagerImpl] - 
2017-02-04 04:48:32,167 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - 
2017-02-04 04:48:38,889 DEBUG [org.quartz.core.QuartzSchedulerThread] - 
2017-02-04 04:48:52,790 DEBUG [org.quartz.core.QuartzSchedulerThread] - 
2017-02-04 04:48:52,790 DEBUG [org.quartz.core.JobRunShell] - 
2017-02-04 04:48:52,790 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - 
2017-02-04 04:48:52,792 DEBUG [org.jasig.casino.services.DefaultServicesManagerImpl] - 
2017-02-04 04:48:52,792 DEBUG [org.jasig.casino.services.DefaultServicesManagerImpl] - 
2017-02-04 04:49:14,365 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - 
2017-02-04 04:49:14,366 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - 
2017-02-04 04:49:19,699 DEBUG [org.quartz.core.QuartzSchedulerThread] - 
2017-02-04 04:49:43,465 DEBUG [org.quartz.core.QuartzSchedulerThread] - 
2017-02-04 04:50:00,978 INFO [org.jasig.casino.authentication.PolicyBasedAuthenticationManager] - 
2017-02-04 04:50:00,978 INFO [org.jasig.casino.authentication.PolicyBasedAuthenticationManager] - 
2017-02-04 04:50:00,978 INFO [org.jasig.inspektr.nhgij.support.Slf4jLogggbhAuditTrailManaver] -

And the data goes on, there can be other log data inbetween the patterns which is not relevant for my parsing though. I have about 40GB of files each contains one day's data.

All these files are gzip compressed. I tried using sc.wholeTextFiles to get a pair RDD, but running into Java heapspace errors as each file goes between 400mb to 800mb (uncompressed).

So i started using sc.textFile and experimenting with one reading one file. I can create a RDD[String], luckily sc.textFile does not return me any heapspace issues when run any action on this RDD.

Here is the code i tried.

val casinop2 = sc.wholeTextFiles("/logdata/casino/catalina.out-20150228.gz")

val casop = casinop2.flatMap(x=>x.split("\n")) .filter(x=> !(x.contains("Reloading registered services") || x.contains("Loaded 2 services.") || x.contains("DEBUG") || x.contains("ERROR") || x.contains("java.lang.RuntimeException") || x.contains("Caused by:") || x.contains("Granted ticket") || x.contains("java.lang.IllegalStateException") || x.startsWith("\t") || x.contains("org.jasig.cas.authentication.PolicyBasedAuthenticationManager") ))

val pattern = new Regex("""((\d{4})-(\d{2})-\d{2}\s\d{2}:\d{2}:\d{2}),\d{3}\s+(\w+)\s+$$(.*)$$\s+\-\s+\<.*\s\=*\s+([W][H][O]\:)\s+(.*)\s+([W][H][A][T]\:)\s+(.*)\s+([A][C][T][I][O][N]\:)\s+(.*)\s+([A][P][P][L][I][C][A][T][I][O][N]\:)\s+(.*)\s+([W][H][E][N]\:)\s+(.*)\s+([A-Z\s]{17}\:)\s+(.*)\s+([A-Z\s]{17}\:)\s+(.*)\s+\=*\s\s\>""") pattern: scala.util.matching.Regex = ((\d{4})-(\d{2})-\d{2}\s\d{2}:\d{2}:\d{2}),\d{3}\s+(\w+)\s+$$(.*)$$\s+\-\s+\<.*\s\=*\s+([W][H][O]\:)\s+(.*)\s+([W][H][A][T]\:)\s+(.*)\s+([A][C][T][I][O][N]\:)\s+(.*)\s+([A][P][P][L][I][C][A][T][I][O][N]\:)\s+(.*)\s+([W][H][E][N]\:)\s+(.*)\s+([A-Z\s]{17}\:)\s+(.*)\s+([A-Z\s]{17}\:)\s+(.*)\s+\=*\s\s\>

case class MLog(datetime: String, message: String, process: String, who: String, what: String, action: String, application: String, when: String, clientipaddress: String, serveripaddress: String,year: String, month: String)

pattern.findAllMatchIn(casop.collect.toString).toList

Now the last statement throws me heapspace error. The reason i want rdd into a string variable is regex needs multi line input, not single line. For single line, i would use map, flatmap etc.

The output i should get from the log file should be

|2017-02-04 04:54:41|   INFO|org.jasig.inspekt...|     s4542732|supplied credenti...|AUTHENTICATION_SU...|        CAS|Sat Feb 04 04:54:...|  175.163.28.77|login.vu.edu.au|2017|   02|
|2017-02-04 04:54:41|   INFO|org.jasig.inspekt...|     s4542732|TGT-78959-EX63Wf2...|TICKET_GRANTING_T...|        CAS|Sat Feb 04 04:54:...|  175.163.28.77|login.vu.edu.au|2017|   02|
|2017-02-04 04:54:41|   INFO|org.jasig.inspekt...|      4542732|ST-474481-jTxCJFB...|SERVICE_TICKET_CR...|        CAS|Sat Feb 04 04:54:...|  175.163.28.77|login.vu.edu.au|2017|   02|
|2017-02-04 04:54:44|   INFO|org.jasig.inspekt...|audit:unknown|ST-474481-jTxCJFB...|SERVICE_TICKET_VA...|        CAS|Sat Feb 04 04:54:...|  203.13.194.68|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|     s3785573|supplied credenti...|AUTHENTICATION_SU...|        CAS|Sat Feb 04 04:55:...| 101.181.28.125|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|     s3785573|TGT-78960-yWaWkcN...|TICKET_GRANTING_T...|        CAS|Sat Feb 04 04:55:...| 101.181.28.125|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|      3785573|ST-474482-rARxdUG...|SERVICE_TICKET_CR...|        CAS|Sat Feb 04 04:55:...| 101.181.28.125|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|audit:unknown|ST-474482-rARxdUG...|SERVICE_TICKET_VA...|        CAS|Sat Feb 04 04:55:...|  203.13.194.68|login.vu.edu.au|2017|   02|
+-------------------+-------+--------------------+-------------+--------------------+--------------------+-----------+--------------------+---------------+---------------+----+-----+

How can we read a multiline input and feed to regex?

Regex on Spark RDD[String] with Regex on multiline

Answers (1)

Related Questions