Reputation: 701
I am trying to find a way to read and split large log files into events in a functional way. I have an imperative way (e.g., uses mutable states, no composable). I looked at Best way to read lines in groups in a flat file -- unfortunately, my file does not have a defined delimiter like END. Also that solution consumes the END line.
My file looks something like this
Nov 28, 2015 2:30:47 PM CST Info Security BEA-090905 Disabling CryptoJ JC ...
Nov 28, 2015 2:30:47 PM CST Info Security BEA-090906 Changing the default .....
2015-11-28 14:33:08,320:ERROR:[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)': [1448742788318]<?xml version="1.0" encoding="UTF-8"?>
<Errors>
<Error ErrorCode="INVERR01"
ErrorDescription="SKU information missing" ErrorUniqueExceptionId="10.7.44.4914487427882870000000000001">
<Attrib
...
Some events are one line, some are stack traces, etc. In the example above, I would like to get 3 events. I have working imperative code
var wip = false
var uow = ""
var sb: StringBuilder = new StringBuilder
Source.
fromFile(f).
getLines.
toStream.
zipWithIndex.
foreach {
case (l, index) => {
l match {
case ln if ln.trim == "" =>
case ln if ue.isBeginLine(ln) && wip =>
processEvent(sb.toString, ue)
sb.setLength(0)
sb.append(ln)
case ln if ue.isBeginLine(ln) && !wip =>
wip = true
sb.append("\n").append(ln)
case ln if wip => sb.append("\n").append(ln)
case ln => log.info(">> Worker: Rejecting: %s".format(ln))
} // match
}} // foreach
I can identify the start of events using the following method ue.isBeginLine. The following is sample code (customized for each different log format) -- I will make the isBeginLine more generic later.
def isBeginLine(s: String): Boolean =
s.startsWith("2015") |
s.startsWith("<Nov 28") |
s.startsWith("WebLogic") |
s.startsWith("INFO:") |
s.startsWith("WARNING:") |
s.startsWith("Parsing") |
s.startsWith("Nov 28")
As mentioned above, I tried the following (from Best way to read lines in groups in a flat file). Unfortunately, the approach below requires a defined terminator or delimiter for each event
val i = Source.
fromFile(f).
getLines
def groupIterator(xs: Iterator[String]) = new Iterator[String] {
var tmp = new StringBuffer
def hasNext = xs.hasNext
def next = xs.takeWhile(!_.startsWith("2015")).mkString("\n")
}
for (g <- groupIterator(i)) println("=======\n" + g + "\n==========")
So out of curiosity, is there a better, functional way to parse log files into events? Ideally, I would like something like the following to aggregate the events.
Source.
fromFile(f).
getLines.
"""splitEvents""".
foldLeft( HashMap[String, Event]() )( .... ) )
Upvotes: 2
Views: 224
Reputation: 167901
Use the same kind of approach, but use BufferedIterator
so you have access to head
as a lookahead. It's a little less functional but you can wrap it yourself to make it act functional again on the outside. The core routine could look something like
def getNextChunk(i: BufferedIterator[String]): Option[Array[String]] =
if (!i.hasNext) None
else {
var ab = Array.newBuilder[String]
ab += i.next
while (i.hasNext && !isRecordStart(i.head)) ab += i.next
Some(ab.result)
}
and then you just call that over and over until you hit a None
. You could, for instance,
Iterator.continually(getNextChunk(i)).
takeWhile(_.isDefined).
map(_.get)
to get an iterator of chunks.
Or you could make your own GroupedIterator
off of a BufferedIterator
that implements the same thing; it'll probably be a bit more efficient that way.
Upvotes: 1