Reputation: 417
I have data in format:
"header1","header2","header3",...
"value11","value12","value13",...
"value21","value22","value23",...
....
What is the best way to parse it in Scalding? I have over 50 columns altogether, but I am only interested in some of them. I tried importing it with Csv("file"), but that doesn't work.
The only solution that comes to mind is to parse it manually with TextLine and disregard the line with offset == 0. But I'm sure there must be a better solution.
Upvotes: 2
Views: 1610
Reputation: 197
It looks like you have 88 fields (well over 22 fields) in your data-set and not just 1. Have a read of:
See text from above link here:
What if I have more than 22 fields in my data-set?
Many of the examples (e.g. in the tutorial/ directory) show that the fields argument is specified as a Scala Tuple when reading a delimited file. However Scala Tuples are currently limited to a maximum of 22 elements. To read-in a data-set with more than 22 fields, you can use a List of Symbols as fields specifier. E.g.
val mySchema = List('first, 'last, 'phone, 'age, 'country)
val input = Csv("/path/to/file.txt", separator = ",",
fields = mySchema) val output = TextLine("/path/to/out.txt") input.read
.project('age, 'country)
.write(Tsv(output))
Another way to specify fields is using Scala Enumerations, which is available in the develop branch (as of Apr 2, 2013), as demonstrated in Tutorial 6:
object Schema extends Enumeration {
val first, last, phone, age,country = Value // arbitrary number of fields
}
import Schema._
Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)
.read.project(first,age).write(Tsv("tutorial/data/output6.tsv"))
So while reading your file supply a schema with all 88 fields using either List or Enumeration (see in above link/quote)
For skipping the header, you can additionally supply skipHeader = true in the Csv constructor.
Csv("tutorial/data/phones.txt", fields = Schema, skipHeader = true)
Upvotes: 1
Reputation: 417
In the end I solved it by parsing each line manually as follows:
def tipPipe = TextLine("tip").read.mapTo('line ->('field1, 'field5)) {
line: String => val arr = line.split("\",\"")
(arr(0).replace("\"", ""), if (arr.size >= 88) arr(4) else "unknown")
}
Upvotes: 1