Spark Development in IntelliJ using Maven
This tutorial will guide you through the setup, compilation, and running of a simple Spark application from scratch. It assumes you have IntelliJ, the IntelliJ scala plugin and maven installed.
At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums. Spark Programming Guide
IntelliJ
Open IntelliJ
Choose: File > New > Project > Maven [Next]
Groupid: machinecreek
Artifact: sparkpoc [Next]
Project name: poc [Finish]
In the Maven projects need to be imported dialog box select > Select Enable Auto Import
Right-click the src/main/java folder
Refactor > Rename: scala
Right-click the src/test/java folder
Refactor > Rename: scala
Open the pom.xml file and paste the following under the groupId, artifactId, and version:
<properties> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <encoding>UTF-8</encoding> <scala.minor.version>2.11</scala.minor.version> <scala.complete.version>${scala.minor.version}.7</scala.complete.version> <spark.version>2.1.1</spark.version> </properties> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.complete.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.minor.version}</artifactId> <version>2.1.1</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.1.1</version> <executions> <execution> <goals> <goal>compile</goal> </goals> <configuration> <args> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> </plugins> </build>
Right-click src/main/scala Choose New > Scala Class
Name: AppPoc.scala
Kind: Object [OK]
Spark Context
Prior to Spark 2.0.0 the sparkConf and sparkContext objects were the point of entry in a Spark application. It provided core functionality such as creating RDDs, accessing textFiles, and parallelization. However to use additional APIS like sparkStreaming, or sparkHive you had to create individual contexts for each using the original spark context (i.e. val myHiveContext = new hiveContext(mySparkContext).
Open AppPoc.scala and paste the following:
package org.machinecreek import org.apache.spark._ object AppPoc { def main(args : Array[String]) { println("Hello from App") val conf = new SparkConf() .setAppName("Testing POC Intellij Maven") .setMaster("local[2]") val sc = new SparkContext(conf) val col = sc.parallelize(5 to 200 by 5) val smp = col.sample(true, 5) val colCount = col.count val smpCount = smp.count println("original number = " + colCount) println("sampled number = " + smpCount) println("Bye from this App") } }
Right-click AppPoc > compile
Right-click AppPoc > run
Review results
Run the spark application from the command line:
cd ../poc mvn clean package cd ../poc/target spark-submit --master local --class org.machinecreek.AppPoc sparkpoc-1.0-SNAPSHOT.jar
Review results. Compare to IntelliJ results above.
Spark Session
As of Spark 2.0.0 a new entry point is available: sparkSession. While it contains all of the original sparkContext functionality it also includes all of the APIs such as: sparkSQL and sparkHive. It requires an additional library for the necessary import (i.e. import org.apache.spark.sql._).
<dependencies> ... <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.minor.version}</artifactId> <version>2.1.1</version> </dependency> ... </dependencies>
Open AppPoc.scala and replace the previous code with:
package com.machinecreek import org.apache.spark.sql._ object AppPoc { def main(args: Array[String]) { println("Hello from App") val spark = SparkSession .builder() .appName("AppPoc") .master("local") .getOrCreate() val df = spark.read .option("header", "true") .csv("people.csv") df.show() } }
Right-click AppPoc > compile
Right-click AppPoc > run
Review results
Run the spark application from the command line:
cd ../poc mvn clean package cd ../poc/target spark-submit --master local --class org.machinecreek.AppPoc sparkpoc-1.0-SNAPSHOT.jar
Review results. Compare to IntelliJ results above.