Setup Spark Development Environment with Intellij and Maven

Spark Development in IntelliJ using Maven

This tutorial will guide you through the setup, compilation, and running of a simple Spark application from scratch. It assumes you have IntelliJ, the IntelliJ scala plugin and maven installed.


At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums. Spark Programming Guide

IntelliJ

Open IntelliJ

Choose: File > New > Project > Maven [Next]

Groupid: machinecreek

Artifact: sparkpoc [Next]

Project name: poc [Finish]

In the Maven projects need to be imported dialog box select > Select Enable Auto Import

Right-click the src/main/java folder

Refactor > Rename: scala

Right-click the src/test/java folder

Refactor > Rename: scala

Open the pom.xml file and paste the following under the groupId, artifactId, and version:

   <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.minor.version>2.11</scala.minor.version>
        <scala.complete.version>${scala.minor.version}.7</scala.complete.version>
        <spark.version>2.1.1</spark.version>
    </properties>

    <dependencies>

        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.complete.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.minor.version}</artifactId>
            <version>2.1.1</version>
        </dependency>
  
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <plugins>

            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.1.1</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

        </plugins>
    </build>

Right-click src/main/scala    Choose New > Scala Class

Name: AppPoc.scala

Kind: Object [OK]

Spark Context

Prior to Spark 2.0.0 the sparkConf and sparkContext objects were the point of entry in a Spark application. It provided core functionality such as creating RDDs, accessing textFiles, and parallelization. However to use additional APIS like sparkStreaming, or sparkHive you had to create individual contexts for each using the original spark context (i.e. val myHiveContext = new hiveContext(mySparkContext).

Open AppPoc.scala and paste the following:

package org.machinecreek

import org.apache.spark._

object AppPoc {

    def main(args : Array[String]) {
      println("Hello from App")
      val conf = new SparkConf()
        .setAppName("Testing POC Intellij Maven")
        .setMaster("local[2]")

      val sc = new SparkContext(conf)

      val col = sc.parallelize(5 to 200 by 5)
      val smp = col.sample(true, 5)
      val colCount = col.count
      val smpCount = smp.count

      println("original number = " + colCount)
      println("sampled number = " + smpCount)
      println("Bye from this App")
    }
}

Right-click AppPoc > compile

Right-click AppPoc > run

Review results

Run the spark application from the command line:

cd ../poc
mvn clean package
cd ../poc/target
spark-submit --master local --class org.machinecreek.AppPoc sparkpoc-1.0-SNAPSHOT.jar

Review results. Compare to IntelliJ results above.

Spark Session

As of Spark 2.0.0 a new entry point is available: sparkSession. While it contains all of the original sparkContext functionality it also includes all of the APIs such as: sparkSQL and sparkHive. It requires an additional library for the necessary import (i.e. import org.apache.spark.sql._).

  
    <dependencies>
       ...

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.minor.version}</artifactId>
            <version>2.1.1</version>
        </dependency>
     
       ...
    </dependencies>

Open AppPoc.scala and replace the previous code with:

package com.machinecreek

import org.apache.spark.sql._

object AppPoc {

  def main(args: Array[String]) {
    println("Hello from App")

    val spark = SparkSession
      .builder()
      .appName("AppPoc")
      .master("local")
      .getOrCreate()

    val df = spark.read
      .option("header", "true")
      .csv("people.csv")
    df.show()
  }
}

Right-click AppPoc > compile

Right-click AppPoc > run

Review results

Run the spark application from the command line:

cd ../poc
mvn clean package
cd ../poc/target
spark-submit --master local --class org.machinecreek.AppPoc sparkpoc-1.0-SNAPSHOT.jar

Review results. Compare to IntelliJ results above.