spark-singularity

by heroku

GitHub Readme.md

Spark Singularity

Use Spark on Heroku in a single dyno. Experiment inexpensively with Spark in the Common Runtime.

Production-quality Spark clusters may be deployed into Private Spaces using spark-in-space.

Deploy

This buildpack provides the following three processes, children of the main web process:

  1. Nginx proxy for * basic password authentication set via environment variable
    • SPACE_PROXY_BASIC_AUTH
    • format username:{PLAIN}password
  2. Spark master * web UI https://your-spark-app.herokuapp.com/ * REST API https://your-spark-app.herokuapp.com/rest
  3. one Spark worker * https://your-spark-app.herokuapp.com/worker

🚨 This app should not be scaled beyond a single dyno. (There is no coordination mechanism between multiple instances; implicitly use 127.0.0.1:7077 as Spark Master.)

Submitting & controlling jobs

Because Spark Singularity is contained in a single dyno with only port 80 exposed, there are two options for submitting jobs:

  1. Spark's REST API, proxied at https://your-spark-app.herokuapp.com/rest
  2. Declare the Spark jobs to submit on start-up by adding each classname on an individual line in Jobfile:
org.example.SparkWordCounter
org.example.SparkWordCluster

Source deploy

heroku create
heroku addons:create bucketeer --as SPARK_S3
heroku buildpacks:add -i 1 https://github.com/heroku/heroku-buildpack-space-proxy.git
heroku buildpacks:add -i 2 heroku/scala
heroku buildpacks:add -i 3 https://github.com/heroku/spark-in-space.git
heroku buildpacks:add -i 4 https://github.com/dpiddy/heroku-buildpack-runit.git
heroku buildpacks:add -i 5 https://github.com/kr/heroku-buildpack-inline.git

Sample import & query

These processes will run out of memory without the large 14GB RAM dynos.

heroku scale web=0
heroku run bin/spark-local-job spark.in.space.Import -s Performance-L
heroku scale web=1:Performance-L
heroku logs -t
# Once complete, avoid ongoing PL dyno charges,
heroku scale web=0:Standard-1x