Reputation: 17
I have a log file like this. I want to create a DataFrame in Scala.
2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2
I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.
Here is everything I tried:
Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..
Upvotes: 1
Views: 2041
Reputation: 8711
If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.
import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)
Results:
GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0
Upvotes: 0
Reputation: 10406
You can simply tell spark that your delimiter is a white space like this:
val df = spark.read.option("delimiter", " ").csv("path/to/file")
Upvotes: 1
Reputation: 879
Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema. Roughly:
val rdd = sc.textFile({logline path}).map(line=>line.split("\\s+"))
Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.
A simpler way to get up and going though would be
spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
Upvotes: 0