How to run Apache Spark on a Windows Machine (using Scala/SBT)

If you bear some affinity with big data community, then I am sure that you definitely would’ve heard of Apache Spark. Apache Spark, touted as one of the most advanced, performant, distributed and in-memory computation engine that’s heavily being used for a number of use-cases ranging from batch ETL processing, machine learning, graph processing to streaming analytics. Typically it runs on top of Hadoop ecosystem and is configured to leverage Hadoop components like YARN (for resource management) and HDFS (to read/write from Hadoop’s distributed and scalable file system) and Input Formats (TextInputFormat, KeyValueTextInputFormat etc). And Hadoop itself is typically deployed in a multi-node cluster environment.

With that said, if you want to play with Spark, may be for experimentation, learning or prototyping, then this inherent distributed infrastructure requirement to run Spark poses ambiguity. Yes we can rent EC2 instances on AWS or VMs on Azure and spin up a Spark cluster and if you are up for paying the recurring cost of their resources then you’re all good. However, if you are someone using a Laptop or Desktop running Windows and don’t want to opt for virtualized or containerized solutions, then this post may be helpful for you. As in this post, you will find steps to “install” spark on your Windows machine which you can conveniently use without any concerns and mainly for experimentation or prototyping purposes.

Requirements:

  • A computer System with Windows (any recent OS version like 7,10 or Server 2012 etc will work provided that you can install JDK and SBT on it)
  • Oracle Java Development Kit (JDK) (I tested with Oracle JDK 1.8.0_101. Try to avoid OpenJDK)
  • SBT (Scala Build Tool that you’ll use to resolve dependencies from maven repositories. I tested with 0.13.15)
  • Scala (I tested with 2.12.1)

Installing the above mentioned requirements should be a no-brainer. Once installed, do confirm their respective installation by issuing following commands in windows CMD:

java -version (for java)

sbt (if you get the console, you're good to go)

scala -version (for scala)

Now with pre-requisites catered, the next steps are:

Creating a Maven compliant folder structure:

Go to any directory in your Windows and create folder structure something like this:

(e.g. in C:\User\irfanelahi\)

create a folder of any name (lets call it ie_spark)

and within that, create sub-folders as:

src\main\scala

so the directory structure looks something like this:

C:\Users\irfanelahi\ie_spark\src\main\scala

There is a lot of rationale behind such folder structure which may require pivoting to discussion about how maven works. Typically there are other folders as well like lib and test but we are ignoring them in favor of our specific task.

now in C:\User\irfanelahi\ie_spark

create a file with the name build.sbt (using notepad, sublime or whatever)

and within that, enter the following:

name := "ie_spark_example"

version := "0.1.1"

scalaVersion := "2.10.6"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.2"

save it and close it.

What you are basically doing is that you are specifying instructions for SBT to resolve the library dependencies that you mentioned in the build.sbt. When you “compile” scala code using SBT, it looks into build.sbt to see what dependencies (in simple terms, what libraries or packages you may require in your program) are needed and it will go the specific repositories (you can specify your own as well) and will download them so that when your program needs them, they are there. If you are familiar with other build tools like Maven then its something similar that you do in its pom.xml.

So basically you are saying that your scala program that you’re gonna use will require Apache Spark Core libraries. So SBT will then go to maven repository to grab this library along with all of its dependencies. I’ve also mentioned Scala’s version to be 2.10.6 because my current version posed discrepancies to work with the Apache Spark’s core library.

now with build.sbt file created, navigate to C:\Users\irfanelahi\ using the CMD (using cd command ofcourse) and then issue:

sbt console

once issued, it will go through the whole process of resolving the dependencies and downloading them. This may take some time. Once done, it will launch Scala REPL (a fancy name for its interactive shell)

like this:

and if you have some familiarity with Spark programming, then you may already know the foundations of it i.e. to use Spark from Scala (or Python/Java) APIs, you first create a Spark Conf object where you also specify the Spark mode. As you are executing it on your lonely laptop, so you will specify “local” mode here. There are other modes and you can also specify how many cores to use but this begs separate details.

thus within Scala interpreter aka REPL:

val conf=new org.apache.spark.SparkConf().setAppName("IE SPARK").setMaster("local")

and the next step is to create Spark Context (the entry point to use Spark Cluster and its APIs) using the conf object:

val sc=new org.apache.spark.SparkContext(conf)

if you’ve followed the steps correctly, then the SparkContext would be up and running and you can use it as if you are running Spark on a distributed cluster to do your experimentation:

e.g.

sc.parallelize((1 to 100).toList).filter(x=>x%2==0).collect //to distributed a Scala Collection into RDDs and then applying filter transformation to just get even numbers.

you can also access files in your local machine with sc.textFile as:

val rdd=sc.textFile("C:\\Users\\irfanelahi\\my_file.txt") 
val rdd_kv=rdd.map(x=>x.split(",")).map(x=>(x(0),1)).reduceByKey(_+_).collect.foreach(println)

Now with Spark running in your Windows machine without any recurring cost (as in AWS/Azure), the only thing that’s required is commitment to learn and claim excellency.

Happy learning and till next time!