My Docker Image for Machine Learning with Spark (Python, Scala) & Jupyter Notebook

If you are a data scientist (or aspiring to become one) and want to do machine learning at scale on big data platforms like machine learning on Spark, you would probably require a development or learning environment where you could work. Specifically, if you want to use Python with Apache Spark (aka PySpark) and want to use Apache Spark Machine Learning library (called as SparkML in Apache Spark 2.X versions or previously called Spark MLLIB), then one of the challenges that aspiring data scientists and machine learning engineers face is the provisioning of a platform where they can practice and learn these tools of the trade. This is exactly the pain point that I endeavoured to address by launching my Docker Image.

 

jupyter notebook spark 2 machine learning in docker pyspark
Image of Jupyter Notebook running in my Docker container. Want to use Apache Spark in Jupyter Notebook for Machine Learning without any hassle? Read on!

Why do you need such Docker Image for Machine Learning with Apache Spark Stuff?

First, a bit of context. If you want to learn Machine Learning using Apache Spark and you don’t have an environment available and you don’t want to bear the OPEX of setting up a platform on elastic cloud environments like Azure or AWS or GCP; you generally have the following options:

  1. You setup and configure Hadoop, Apache Spark and all of its dependencies on your own. Which if you can do, is quite commendable. But its understandably and relatively time consuming
  2. You resort to sandbox environments made available by commercial Hadoop vendors like Cloudera and Hortonworks. Trust me those sandbox environment, available in the form of virtual machines or even as Docker Images are great like they provide many services of Hadoop ecosystem pre-configured but there are some challenges to them. Firstly, they require a lot of resources to use them. For instance both Cloudera and Hortonworks state on their website that they require atleast 8GB of RAM if you want to properly use their environment. Cloudera QuickStart VMs don’t come with Apache Spark 2 and if you want to install in the recommended way, you need to access Cloudera Manager which requires more than 8GB of RAM and also the version of Cloudera Manager that comes with QuickStart isn’t recent one and you’ll face a lot of challenges to install Apache Spark on them. On the other hand, Hortonworks Sandbox docker container ended up consuming 21GB space on my laptop not to mention that it also required 8+GB of RAM.
  3. Plus, none of the sandbox environments by Hadoop commercial vendors come with the tool-set that we machine learning engineers use for instance we need Anaconda and the amazing Jupyter notebooks and all of Python packages like numpy, pandas to work on Machine Learning problems. Plus, we can’t live without Git as we need to manage our source code proficiently.

So with that being said, now if you want to learn Machine Learning using Apache Spark 2 along with Jupyter Notebooks and the mandatory machine learning related Python packages plus Git, you can look no further and use my recently launched Docker image for free. I use that Docker Image by myself and I believe that you will find it to be valuable in your learning journeys as well. Furthermore, I intend to use this Docker Image in my upcoming Udemy course about Machine Learning on Apache Spark 2.

How to use my Docker Image for Machine Learning on Apache Spark 2?

Though all the instructions are explained in quite detailed fashion on the Docker Hub link but I am reproducing them here for everyone’s convenience.

  1. Firstly, download Docker in your operating system. For Windows users, instructions can be found here
  2. Once installed, pull the image from Docker Hub using:
    docker pull irfanelahids/spark2ml
  3. After pulling the image, run it using the following:
    docker run -it -p 8085:8085 irfanelahids/spark2ml
  4. Once the container has started running, issue these commands to run Jupyter notebook:/root/anaconda2/bin/jupyter notebook --port=8085 --ip=0.0.0.0 --allow
  5. Once you do that, it will display a URL with a token to login which would look something like this:http://8883b18af5dc:8085/?token=b6791d47b3f47e8c0d41c8fe54e2c02d30d0ccee84c7867e&token=b6791d47b3f47e8c0d41c8fe54e2c02d30d0ccee84c7867eCopy that, open a web-browser in your host system (e.g. Google Chrome) and copy that URL in the browser as follows:http://8883b18af5dc:8085/?token=b6791d47b3f47e8c0d41c8fe54e2c02d30d0ccee84c7867e&token=b6791d47b3f47e8c0d41c8fe54e2c02d30d0ccee84c7867ei.e. replace the section after http:// and before : i.e. 8883b18af5dc with the host where its running. If you are running it in your system, then just do:http://localhost:8085/?token=b6791d47b3f47e8c0d41c8fe54e2c02d30d0ccee84c7867e&token=b6791d47b3f47e8c0d41c8fe54e2c02d30d0ccee84c7867eand it will open Jupyter Notebook.
  6. To initialise Spark Context, type this in a Jupyter Notebook cell:
    import findspark
    
    findspark.init()
    
    import pyspark
    
    import random
    
    sc = pyspark.SparkContext(appName="Name of your application")
  7. Additionally, you can do the following:
    1. Launch pyspark2 shell on the shell via pyspark command
    2. Launch spark-shell (Scala) on the shell via spark-shell command
    3. Launch Python shell on the shell via python command
    4. Use git client using Git commands
    5. Install required Python modules using pip
      and much more.

For more details related to what’s included in this Docker image along with version related details, go to the following Docker hub link

Do use that and share your feedback with me. Stay connected.