Bernie Wong
Bernie Wong

Reputation: 701

How to split lines in a file into groups?

I am trying to find a way to read and split large log files into events in a functional way. I have an imperative way (e.g., uses mutable states, no composable). I looked at Best way to read lines in groups in a flat file -- unfortunately, my file does not have a defined delimiter like END. Also that solution consumes the END line.

My file looks something like this

Nov 28, 2015 2:30:47 PM CST Info Security BEA-090905 Disabling CryptoJ JC ...
Nov 28, 2015 2:30:47 PM CST Info Security BEA-090906 Changing the default .....
2015-11-28 14:33:08,320:ERROR:[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)': [1448742788318]<?xml version="1.0" encoding="UTF-8"?>
    <Errors>
        <Error ErrorCode="INVERR01"
            ErrorDescription="SKU information missing" ErrorUniqueExceptionId="10.7.44.4914487427882870000000000001">
        <Attrib  
    ...

Some events are one line, some are stack traces, etc. In the example above, I would like to get 3 events. I have working imperative code

var wip = false
var uow = ""
var sb: StringBuilder = new StringBuilder
Source.
  fromFile(f).
  getLines.
  toStream.
  zipWithIndex.
  foreach {
    case (l, index) => {
      l match {
        case ln if ln.trim == "" =>

        case ln if ue.isBeginLine(ln) && wip => 
          processEvent(sb.toString, ue)
          sb.setLength(0) 
          sb.append(ln)

        case ln if ue.isBeginLine(ln) && !wip => 
          wip = true
          sb.append("\n").append(ln)

        case ln if wip => sb.append("\n").append(ln)

        case ln        => log.info(">> Worker:  Rejecting: %s".format(ln))
      } // match
  }} // foreach

I can identify the start of events using the following method ue.isBeginLine. The following is sample code (customized for each different log format) -- I will make the isBeginLine more generic later.

def isBeginLine(s: String): Boolean =
      s.startsWith("2015") |
      s.startsWith("<Nov 28") |
      s.startsWith("WebLogic") |
      s.startsWith("INFO:") |
      s.startsWith("WARNING:") |
      s.startsWith("Parsing") |
      s.startsWith("Nov 28")

As mentioned above, I tried the following (from Best way to read lines in groups in a flat file). Unfortunately, the approach below requires a defined terminator or delimiter for each event

val i = Source.
    fromFile(f).
    getLines

def groupIterator(xs: Iterator[String]) = new Iterator[String] { 
    var tmp = new StringBuffer
    def hasNext = xs.hasNext 
    def next = xs.takeWhile(!_.startsWith("2015")).mkString("\n") 
}

for (g <- groupIterator(i)) println("=======\n" + g + "\n==========")

So out of curiosity, is there a better, functional way to parse log files into events? Ideally, I would like something like the following to aggregate the events.

Source.
  fromFile(f).
  getLines.
  """splitEvents""".   
  foldLeft( HashMap[String, Event]() )( .... ) )

Upvotes: 2

Views: 224

Answers (1)

Rex Kerr
Rex Kerr

Reputation: 167901

Use the same kind of approach, but use BufferedIterator so you have access to head as a lookahead. It's a little less functional but you can wrap it yourself to make it act functional again on the outside. The core routine could look something like

def getNextChunk(i: BufferedIterator[String]): Option[Array[String]] =
  if (!i.hasNext) None
  else {
    var ab = Array.newBuilder[String]
    ab += i.next
    while (i.hasNext && !isRecordStart(i.head)) ab += i.next
    Some(ab.result)
  }

and then you just call that over and over until you hit a None. You could, for instance,

Iterator.continually(getNextChunk(i)).
  takeWhile(_.isDefined).
  map(_.get)

to get an iterator of chunks.

Or you could make your own GroupedIterator off of a BufferedIterator that implements the same thing; it'll probably be a bit more efficient that way.

Upvotes: 1

Related Questions