Jdeboer
Jdeboer

Reputation: 93

Format log with Spark/Scala

I've got a log file that contains some information that I want to process through Spark. The only problem is that the entire file itself isn't formatted very properly. So I'm trying to neatly format it and only grab the data I need.

Now i already noticed that most of the useful information contain a "INFO" tag. So I decided to filter by that using:

val testje = realdata.filter(line => line.contains("INFO"))

But now I want to process the resulting data to a SQLContext to I can visualize the data (in zeppelin) However;

Here's an (very small) example of what the data looks like now:

2016-03-08 14:55:29,637 INFO [ajp-nio-8009-exec-1] n.t.f.s.FloorService [FloorService.java:281] Snoozing. Wait 569 more milliseconds. Time passed : 4431
2016-03-08 14:55:29,964 INFO [ajp-nio-8009-exec-3] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor test received update from tile: 1, data = [false, false, false, false, false, false, false, false]
2016-03-08 14:55:30,582 INFO [ajp-nio-8009-exec-2] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor test received update from tile: 1, data = [false, false, false, false, false, false, true, false]
2016-03-08 14:55:30,592 INFO [ajp-nio-8009-exec-2] n.t.f.s.FloorService [FloorService.java:284] delta time : 5387
2016-03-08 14:55:30,595 INFO [ajp-nio-8009-exec-2] n.t.f.s.ActivityService [ActivityService.java:31] Activity added for floor with id: test
2016-03-08 14:55:30,854 INFO [ajp-nio-8009-exec-4] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor test received update from tile: 1, data = [false, false, false, false, false, false, false, false]

All i really need are the date, time, tile ID and the boolean values.

Is there any way to properly format this without having to account for all the junk data?

Here's what I'm trying right now (disclaimer, I'm fairly new at this and I'm kind of winging it ^^'):

import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

val realdata = sc.textFile("/media/application.txt")

case class testClass(date: String, time: String, level: String, unknown1: String, unknownConsumer: String, unknownConsumer2: String, vloer: String, tegel: String, msg: String, bool1: String, bool2: String, bool3: String, bool4: String, bool5: String, bool6: String, bool7: String, bool8: String, batchsize: String, troepje1: String, troepje2: String)

//val testje = realdata.filter(line => line.contains("INFO"))
val mapData = realdata.map(s => s.split(" ")).filter(line => line.contains("INFO")).map(
    s => testClass(s(0),
        s(1),
        s(2),
        s(3),
        s(4), 
        s(5),
        s(6),
        s(7),
        s(8),
        s(9),
        s(10),
        s(11),
        s(12),
        s(13),
        s(14),
        s(15),
        s(16),
        s(17),
        s(18),
        s(19)
        )
    ).toDF()
    mapData.registerTempTable("test")

Upvotes: 0

Views: 1266

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

I suggest you to filter with data and not INFO because the lines you want to split and convert into dataframe contains data.
I have modified your code a little bit to suit your case class and you can edit more according to your need

val mapData = realdata
.filter(line => line.contains("data"))
.map(s => s.split(" ").toList)
.map(
  s => testClass(s(0),
    s(1).split(",")(0),
    s(1).split(",")(1),
    s(3),
    s(4),
    s(5),
    s(6),
    s(7),
    s(8),
    s(15),
    s(16),
    s(17),
    s(18),
    s(19),
    s(20),
    s(21),
    s(22),
    "",
    "",
    ""
  )
)
.toDF()
mapData.show(false)

Hope it helps

Upvotes: 1

Nils
Nils

Reputation: 446

I would try to do something like this:

val regex = """^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}),(\d{0,3}) INFO .+Floor test received update from tile: (\d+), data = (\[((false|true)(, ){0,1})+\])$""".r
final case class LogLine(date: Instant, tileId: String, data: Seq[Boolean])
realdata.flatMap({
  case regex(date, time, millis, tileId, data, _*) =>
    val mapper = new ObjectMapper() with ScalaObjectMapper
    mapper.registerModule(DefaultScalaModule)

    Seq(LogLine(
      Instant.parse(s"${date}T$time.${millis}Z"),
      tileId,
      mapper.readValue[Seq[Boolean]](data)
    ))
  case _ => Nil
})

The case class would be multi-dimensional, but that is something you probably want in this case. You can always flatten it afterwards if you really need it.

If you want to improve performance you can use mapPartitions instead of flatMap and reuse the ObjectMapper.

Upvotes: 1

Related Questions