Chris Snow
Chris Snow

Reputation: 24626

load data from bluemix object store in spark

In the notebook, there is an option for inserting code from a object store file. However, when I click on the link it populates a notebook cell with a set of variables. E.g.

auth_url : https://identity.open.softlayer.com
project : object_storage_***
project_id : ****
region : dallas
user_id : *****
domain_id : *****
domain_name : *****
username : user_*****
password : *****
filename : block_1.csv
container : notebooks
tenantId : ****

How do I use this information in a spark command to load the data? Presumably something like this:

scala> val data = sc.textFile( ... )

Question: What is the exact command?

Upvotes: 0

Views: 698

Answers (3)

dman
dman

Reputation: 321

I am using a Jupyter notebook with the Spark as a Service app in Bluemix with the Scala 2.10 kernel.

I was able to access a file stored in the Swift Object store using this code below. I think this is method slightly easier since I was able to select the file in the object store in the Jupyter notebook and just use the insert code function to add the code to my notebook without modification. Here is a snippet below

def setConfig(credentials : scala.collection.mutable.HashMap[String, String]) = {
val prefix = "fs.swift.service." + credentials("name") 
var hconf = sc.getConf
hconf.set(prefix + ".auth.url", credentials("auth_url")+"/v3/auth/tokens")
hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
hconf.set(prefix + ".tenant", credentials("project_id"))
hconf.set(prefix + ".username", credentials("user_id"))
hconf.set(prefix + ".password", credentials("password"))
hconf.set(prefix + ".http.port", "8080")
hconf.set(prefix + ".region", credentials("region"))
hconf.set(prefix + ".public", "True")   
}

var credentials_1 = scala.collection.mutable.HashMap[String, String](
  "auth_url"->"https://identity.open.softlayer.com",
  "project"->"objexxxxxxxxxxxxx858",
  "project_id"->"f4xxxxxxxxxxxxxxa7",
  "region"->"dallas",
  "user_id"->"e4fc7294xxxxxxx5",
  "domain_id"->"7527xxxxxxxxxxx44f",
  "domain_name"->"9xxxxx9",
  "username"->"Admin_",
  "password"->"""xxxxxxxxxxxxx""",
  "filename"->"scores.dat",
  "container"->"notebooks",
  "tenantId"->"s69xxxxxxxxxxxxx4f0"
)

credentials_1("name") = "spark"
setConfig(credentials_1)
val file = sc.textFile("swift://notebooks." + credentials_1("name") + "/" + credentials_1("filename"))
file.take(5)

Upvotes: 2

chsh
chsh

Reputation: 2414

The object storage Insert to Code option appears to only dump in a list of preferences. I came up with this small scala helper to extract the properties from that string it dumps in:

import scala.collection.breakOut

val YOUR_DATASOURCE = """<<paste_your_datasource_attributes_here>>"""

def setConfig(name:String, dsConfiguration:String) : Unit = {
    val pfx = "fs.swift.service." + name
    val settings:Map[String,String] = dsConfiguration.split("\\n").
        map(l=>(l.split(":",2)(0).trim(), l.split(":",2)(1).trim()))(breakOut)

    val conf = sc.getConf
    conf.set(pfx + "auth.url", settings.getOrElse("auth_url",""))
    conf.set(pfx + "tenant", settings.getOrElse("tenantId", ""))
    conf.set(pfx + "username", settings.getOrElse("username", ""))
    conf.set(pfx + "password", settings.getOrElse("password", ""))
    conf.set(pfx + "apikey", settings.getOrElse("password", ""))
    conf.set(pfx + "auth.endpoint.prefix", "endpoints")
}

setConfig("spark", YOUR_DATASOURCE)

Copy this into the notebook, then put your cursor on the empty line between the multiline quotes (""") and then click your Insert to Code link for your file.

If it works correctly, you should then be able to build the swift URL to your file:

val file = sc.textFile("swift://notebooks.spark/TheFileYouClickedOn.txt")

In this case, notebooks is the container name, spark is the data source name (the first argument to the setConfig function), and followers.txt is the filename I'm using.

All put together it would look something like this:

import scala.collection.breakOut

def setConfig(name:String, dsConfiguration:String) : Unit = {
    val pfx = "fs.swift.service." + name
    val settings:Map[String,String] = dsConfiguration.split("\\n").
        map(l=>(l.split(":",2)(0).trim(), l.split(":",2)(1).trim()))(breakOut)

    val conf = sc.getConf
    conf.set(pfx + "auth.url", settings.getOrElse("auth_url",""))
    conf.set(pfx + "tenant", settings.getOrElse("tenantId", ""))
    conf.set(pfx + "username", settings.getOrElse("username", ""))
    conf.set(pfx + "password", settings.getOrElse("password", ""))
    conf.set(pfx + "apikey", settings.getOrElse("password", ""))
    conf.set(pfx + "auth.endpoint.prefix", "endpoints")
}

val YOUR_DATASOURCE = """auth_url : https://identity.open.softlayer.com
project : object_storage_abc123
project_id : abc123abc123abc123abc123abc123
region : dallas
user_id : 123abc123abc123abc123abc123abc
domain_id : a1b2c3a1b2c3a1b2c3a1b2c3a1b2c3
domain_name : 123456
username : user_a1b2c3a1b2c3a1b2c3a1b2c3a1b2c3
password : WhateverPasswordValueGoesHere
filename : TheFileYouClickedOn.txt
container : notebooks
tenantId : a1b2c3-a1b2c3a1b2c3-a1b2c3a1b2c3
"""

setConfig("spark", YOUR_DATASOURCE)

val file = sc.textFile("swift://notebooks.spark/TheFileYouClickedOn.txt")

// Do stuff with your file.

You could also have this parse the filename and create the textFile reference for you but I prefer to keep them separated as you only need the connection to the one ObjectStore to use whichever files are located in it. It could probably also stand to have some empty line detection, etc., but for now I just deal with that myself.

Upvotes: 2

Trent Gray-Donald
Trent Gray-Donald

Reputation: 2346

Please see: https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html -= specifically the section on "Reusing existing Object Storage...". What version of object store are you interested in consuming from? (v1, v2, v3, or SL OS?)

Upvotes: 1

Related Questions