Reputation: 73
I'm working with Spark Streaming using Scala. I need to read a .csv file dinamically from HDFS directory with this line:
val lines = ssc.textFileStream("/user/root/")
I use the following command line to put the file into HDFS:
hdfs dfs -put ./head40k.csv
It works fine with a relatively small file. When I try with a larger one, I get this error:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/root/head800k.csv._COPYING
I can understand why, but I don't know how to fix it. I've tried this solution too:
hdfs dfs -put ./head800k.csv /user
hdfs dfs -mv /usr/head800k.csv /user/root
but my program doesn't read the file. Any ideas? Thanks in advance
PROGRAM:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
import scala.sys.process._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import java.util.HashMap
import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
import org.apache.hadoop.fs.Path
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import StreamingContext._
object Traccia2014{
def main(args: Array[String]){
if (args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <test><topicRisultato>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}
val Array(brokers,risultato) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.textFileStream("/user/root/")
//val lines= ssc.fileStream[LongWritable, Text, TextInputFormat](directory="/user/root/",
// filter = (path: org.apache.hadoop.fs.Path) => //(!path.getName.endsWith("._COPYING")),newFilesOnly = true)
//********** Definizioni Producer***********
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val slice=30
lines.foreachRDD( rdd => {
if(!rdd.isEmpty){
val min=rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
if(!min.isEmpty){
val ipDst= rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
if(!ipDst.isEmpty){
val ipSrc=rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(1)),1)).reduceByKey(_ + _)
if(!ipSrc.isEmpty){
val Rapporto=ipSrc.leftOuterJoin(ipDst).mapValues{case (x,y) => x.asInstanceOf[Int] / y.getOrElse(1) }
val RapportoFiltrato=Rapporto.filter{case (key, value) => value > 100 }
println("###(ConsumerScala) CalcoloRapporti: ###")
Rapporto.collect().foreach(println)
val str = Rapporto.collect().mkString("\n")
println(s"###(ConsumerScala) Produco Risultato : ${str}")
val message = new ProducerRecord[String, String](risultato, null, str)
producer.send(message)
Thread.sleep(1000)
}else{
println("src vuoto")
}
}else{
println("dst vuoto")
}
}else{
println("min vuoto")
}
}else
{
println("rdd vuoto")
}
})//foreach
ssc.start()
ssc.awaitTermination()
} }
Upvotes: 0
Views: 9367
Reputation: 11577
/user/root/head800k.csv._COPYING
is a transient file that is created while the copy process is on going. Wait for the copy process to complete and you will have a fail without the _COPYING
suffix ie /user/root/head800k.csv
.
to filter these transient in your spark-streaming job you can use the fileStream
method documented here
as shown below for example
ssc.fileStream[LongWritable, Text, TextInputFormat](
directory="/user/root/",
filter = (path: org.apache.hadoop.fs.Path) => (!path.getName.endsWith("_COPYING")), // add other filters like files starting with dot etc
newFilesOnly = true)
EDIT
since you are moving your file from local filesystem to HDFS, the best solution is to move your file to a temporary staging location in the HDFS and then move them to your target directory. copying or moving within the HDFS filesystem should avoid the transient files
Upvotes: 1