predictionio-buildpack

by takahito-miyamoto

GitHub Readme.md

Heroku buildpack for PredictionIO

Enables data scientists and developers to deploy custom machine learning services created with PredictionIO.

This buildpack is part of an exploration into utilizing the Heroku developer experience to simplify data science operations. When considering this proof-of-concept technology, please note its current limitations. We'd love to hear from you. Open issues on this repo with feedback and questions.

Engines

Supports engines created for PredictionIO 0.10.0-incubating.

🐸 How to deploy a template or custom engine

Architecture

This buildpack transforms the Scala source-code of a PredictionIO engine into a Heroku app.

Diagram of Deployment to Heroku Common Runtime

The events data can be stored in:

  • PredictionIO Eventserver backed by Heroku PostgreSQL
    • directly compatible with most engine templates
  • custom data store such as Heroku Connect with PostgreSQL or RDD/DataFrames stored in HDFS
    • requires a custom implementaion of DataSource.scala.

Limitations

Memory

This buildpack automatically trains the predictive model during release phase, which runs in a one-off dyno. That dyno's memory capacity is a limiting factor at this time. Only Performance dynos with 2.5GB or 14GB RAM provide reasonable utility.

This limitation can be worked-around by pointing the engine at an existing Spark cluster. See: customizing environment variables, PIO_SPARK_OPTS & PIO_TRAIN_SPARK_OPTS.

Private Network

This is not a limitation for PredictionIO itself, but for the underlying Spark service. Spark clusters require a private network, so they cannot be deployed in the Common Runtime.

To operate in the Common Runtime this buildpack executes Spark as a sub-process (i.e. --master local) within one-off and web dynos.

This buildpack also supports executing jobs on an existing Spark cluster. See: customizing environment variables, PIO_SPARK_OPTS & PIO_TRAIN_SPARK_OPTS.

Additional Service Dependencies

Engines may require Elasticsearch [ES] which is not currently supported on Heroku (see this pull request).

Heroku Postgres is the default storage repository, so this does not effect many engines.

There is work underway in the PredictionIO project to support ES by upgrading to ES 5.x and migrating to pure-REST interface.

Stateless Builds

PredictionIO 0.10.0-incubating requires a database connection during the build phase. While this works fine in the Common Runtime, it is not compatible with Private Databases.

There is work underway in the PredictionIO project to solve this problem by making pio build a stateless command. This upcoming feature is verified in the compile with 0.11.0-SNAPSHOT test.

Config Files

PredictionIO engine templates typically have some configuration values stored alongside the source code in engine.json. Some of these values may vary between deployments, such as in a pipeline, where the same slug will be used to connect to different databases for Review Apps, Staging, & Production.

Heroku config vars solve many of the problems associated with these committed configuration files. When using a template or implementing a custom engine, the developer may migrate the engine to read the environment variables instead of the default file-based config, e.g. sys.env("PIO_EVENTSERVER_APP_NAME").

Testing

Buildpack Build Status

Tests covering this buildpack's build and release functionality are implemented with heroku-buildpack-testrunner. Engine test cases are staged in the test/fixtures/.

Setup testrunner with Docker, then run tests with:

docker-compose -p pio -f test/docker-compose.yml run testrunner

Individual Apps

Engines deployed as Heroku apps may automatically run their sbt test suite using Heroku CI (beta):

Heroku CI automatically runs tests for every subsequent push to your GitHub repository. Any push to any branch triggers a test run, including a push to master. This means that all GitHub pull requests are automatically tested, along with any merges to master.

Test runs are executed inside an ephemeral Heroku app that is provisioned for the test run. The app is destroyed when the run completes.