Reputation: 741
I am using spark/scala to load files from s3. My files are located under :
s3://bucket/yyyy/mm/dd/HH/parts...files
I need to generate the file paths with startDate(string) and endDate(string)
import org.joda.time.{DateTime, DateTimeZone}
import org.joda.time.Days
import org.joda.time.DurationFieldType
import org.joda.time.LocalDate
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter
val startDate = "2016-09-25T04:00:00Z"
val endDate = "2016-10-23T04:00:00Z"
val s3Bucket = "s3://test_bucket/"
def getUtilDate(timestamp: String): java.sql.Date = new java.sql.Date(new DateTime(timestamp, DateTimeZone.UTC).toDate().getTime())
val start = new LocalDate(getUtilDate(startDate))
val end = new LocalDate(getUtilDate(endDate))
val days: Int = Days.daysBetween(start, end).getDays
val files: Seq[String] = (0 to days)
.map(start.plusDays)
.map(d => s"$s3Bucket${DateTimeFormat.forPattern("yyyy/MM/dd/HH").print(d)}/*")
val testFiles = sc.textFile(files.mkString(","), 20000)
val df = sqlContext.read.json(testFiles)
Since sqlContext.read.json() doesn't take multiple paths.
But this doesn't give the HH. It shows as s3://test_bucket/2016/09/26/��/*
Can someone tell me why the HH shows as ��. Is there any way I could get all the hours between two days i.e. between "2016-09-25T04:00:00Z" and "2016-10-23T04:00:00Z"
like
s3://test_bucket/2016/09/25/04/*.....
to......s3://test_bucket/2016/10/23/04/*
Upvotes: 1
Views: 908
Reputation: 15464
You have used LocalDate
which is a date-only class, it explicitly does not contain time information (this is different to java.sql.Date
which contains time and date info). Therefore Joda cannot render the "HH" as hour, as it does not have that info.
Try instead:
val startDate = "2016-09-25T04:00:00Z"
val endDate = "2016-10-23T04:00:00Z"
val s3Bucket = "s3://test_bucket/"
def getUtilDate(timestamp: String): org.joda.time.DateTime =
new DateTime(timestamp, DateTimeZone.UTC)
val start = getUtilDate(startDate)
val end = getUtilDate(endDate)
val days: Int = Days.daysBetween(start, end).getDays
val files: Seq[String] = (0 to days)
.map(start.plusDays)
.map(d => s"$s3Bucket${DateTimeFormat.forPattern("yyyy/MM/dd/HH").print(d)}/*")
println(files)
To list each hour between the two DateTimes, you need to loop from start
to end
, using "plusHours" each time. In most languages you'd use a "for" loop for that, but Scala doesn't have a C-style for loop. There are two main ways to do this in Scala; I've shown both below:
val startDate = "2016-09-25T04:00:00Z"
val endDate = "2016-10-23T04:00:00Z"
val s3Bucket = "s3://test_bucket/"
def getUtilDate(timestamp: String): org.joda.time.DateTime =
new DateTime(timestamp, DateTimeZone.UTC)
val start = getUtilDate(startDate)
val end = getUtilDate(endDate)
val fmt = DateTimeFormat.forPattern("yyyy/MM/dd/HH")
def bucketName(date: DateTime): String = s"$s3Bucket${fmt.print(date)}"
{
// Imperative style:
var t = start
val files = mutable.Buffer[String]()
do {
files += bucketName(t)
t = t.plusHours(1)
} while (t.compareTo(end) < 0)
println(files)
}
{
// Functional style:
@tailrec
def loop(t: DateTime, acc: Seq[String]): Seq[String] = t match {
case `end` => acc
case _ =>
loop(
t.plusHours(1),
acc :+ bucketName(t))
}
val files = loop(start, Vector())
println(files)
}
Upvotes: 2
Reputation: 8967
You can use ChronoUnit
to get the HOURS difference between two dates.
val minutes = ChronoUnit.HOURS.between(dateTime, LocalDateTime.now())
Upvotes: 1