San
San

Reputation: 17

How to replace white space with comma in Spark ( with Scala)?

I have a log file like this. I want to create a DataFrame in Scala.

2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2

I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.

Here is everything I tried:

  1. Tried importing it as text file first to see if there is a replaceAll method.
  2. Tried splitting on the basis of space.

Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..

Upvotes: 1

Views: 2041

Answers (3)

stack0114106
stack0114106

Reputation: 8711

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0

Upvotes: 0

Oli
Oli

Reputation: 10406

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")

Upvotes: 1

benlaird
benlaird

Reputation: 879

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema. Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)

Upvotes: 0

Related Questions