Reputation: 564
Does anyone know how to read a text file in SparkR version 1.4.0? Are there any Spark packages available for that?
Upvotes: 4
Views: 5856
Reputation: 330173
Spark 1.6+
You can use text
input format to read text file as a DataFrame
:
read.df(sqlContext=sqlContext, source="text", path="README.md")
Spark <= 1.5
Short answer is you don't. SparkR 1.4 has been almost completely stripped from low level API, leaving only a limited subset of Data Frame operations. As you can read on an old SparkR webpage:
As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4). (...) Initial support for Spark in R be focussed on high level operations instead of low level ETL.
Probably the closest thing is to load text files using spark-csv
:
> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
| C0|
+--------------------+
| # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+
Since typical RDD operations like map
, flatMap
, reduce
or filter
are gone as well it is probably what you want anyway.
Now, low level API is still underneath so you can always do something like below but I doubt it is a good idea. SparkR developers most likely had a good reason to make it private. To quote :::
man page:
It is typically a design mistake to use ‘:::’ in your code since the corresponding object has probably been kept internal for a good reason. Consider contacting the package maintainer if you feel the need to access the object for anything but mere inspection.
Even if you're willing to ignore good coding practices I it is most likely not worth the time. Pre 1.4 low level API is embarrassingly slow and clumsy and without all the goodness of the Catalyst optimizer it is most likely the same when it comes to internal 1.4 API.
> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)
[[1]]
[1] 14
[[2]]
[1] 0
[[3]]
[1] 78
Not that spark-csv
, unlike textFile
, ignores empty lines.
Upvotes: 3
Reputation: 76
Infact, you could use the databricks/spark-csv package to handle tsv files too.
For example,
data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")
A host of options is provided here - databricks-spark-csv#features
Upvotes: 0
Reputation: 8395
Please follow the links http://ampcamp.berkeley.edu/5/exercises/sparkr.html
we can simply use -
textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/README.md")
While checking the SparkR code, Context.R has textFile method , so ideally a SparkContext must have textFile API to create the RDD , but thats missing in doc.
# Create an RDD from a text file.
#
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
#
# @param sc SparkContext to use
# @param path Path of file to read. A vector of multiple paths is allowed.
# @param minPartitions Minimum number of partitions to be created. If NULL, the default
# value is chosen based on available parallelism.
# @return RDD where each item is of type \code{character}
# @export
# @examples
#\dontrun{
# sc <- sparkR.init()
# lines <- textFile(sc, "myfile.txt")
#}
textFile <- function(sc, path, minPartitions = NULL) {
# Allow the user to have a more flexible definiton of the text file path
path <- suppressWarnings(normalizePath(path))
# Convert a string vector of paths to a string containing comma separated paths
path <- paste(path, collapse = ",")
jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
# jrdd is of type JavaRDD[String]
RDD(jrdd, "string")
}
Follow the link https://github.com/apache/spark/blob/master/R/pkg/R/context.R
For test case https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_rdd.R
Upvotes: 0