Reputation: 1379
I have an avro file which I am reading as follows:
avroFile <-read.df(sqlContext, "avro", "com.databricks.spark.avro")
This file as lat/lon columns but I am not able to plot them like a regular dataframe. Neither am I able to access the column using the '$' operator.
ex.
avroFile$latitude
Any help regarding avro files and operation on them using R are appreciated.
Upvotes: 2
Views: 3939
Reputation: 81
If you want to use ggplot2
for plotting, try ggplot2.SparkR. This package allows you to take SparkR DataFrame
directly as input for ggplot()
function call.
https://github.com/SKKU-SKT/ggplot2.SparkR
Upvotes: 8
Reputation: 329
As zero323 mentioned, you cannot currently run R visualizations on distributed SparkR DataFrames. You can run them on local data.frames. Here is one way you could make a new dataframe with just the columns you want to plot, and then collect a random sample of them to a local data.frame which you can plot from
latlong <- (avroFile, avroFile$latitude, avrofile$longitude)
latlongsample <- collect(sample(latlong, FALSE, .1))
plot(latlongsample)
the signature for sample method is: sample(x, withReplacement, fraction, seed)
Upvotes: 3
Reputation: 330073
And you won't be able to plot it directly. SparkR DataFrame
is not compatible with functions which expect data.frame
as an input. This is not even a data structure in a strict sense but simply a recipe how to process input data. It is materialized only when you execute an action.
If you want to plot it you'll have collect
it first.. Beware that it fetches all the data the local machine so typically it is something you want to avoid on full data set.
Upvotes: 4