Spark1.5.1+Scala2.10.6+IntelliJ14.1 environment setup

The whole process:

1. Intall JDK8

Download JDK8 from the official website and install it. We don’t need to make any configurations like in windows, instead we can use it immediately.

1
2
3
4
5
6
7
//test it
java -version

//it succeeds
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

2. Intall Scala

Download Scala2.10.6 from the official website. Because we will use Spark1.5.1, the matching version of Scala is 2.10.*, we download the latest version of 2.10.*. Then we unzip it to the installing folders. Here I installed it in /usr/local/opt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cd /usr/local/opt
sudo mkdir scala2.10.6 //create a new folder for scala

sudo tar -zxvf scala-2.10.6.tgz -C /usr/local/opt/scala2.10.6 //unzip scala to installing folder

//test it
cd /usr/local/opt/scala2.10.6/scala-2.10.6
./bin/scala

//it succeeds
Welcome to Scala version 2.10.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.

scala>
//we can use Ctrl+D to exit the scala console

3. Install Spark

Download Spark1.5.1 from the official website. There are many versions. We choose Pre-built for hadoop 2.6 and later. Like installing Scala, we also unzip it to /usr/local/opt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
cd /usr/local/opt
sudo mkdir spark1.5.1

sudo tar -zxvf spark-1.5.1-bin-hadoop2.6.tgz -C /usr/local/opt/spark1.5.1

//test it
cd /usr/local/opt/spark1.5.1/spark-1.5.1-bin-hadoop2.6
sudo ./bin/spark-shell

//it succeeds
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.5.1
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
15/11/02 15:55:50 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
15/11/02 15:55:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/02 15:55:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/02 15:55:57 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/11/02 15:55:57 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/11/02 15:55:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/02 15:55:59 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/02 15:55:59 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala>

To make it easier to invoke Spark commands and Scala commands, we add the installing path to environment variables.

1
2
3
4
5
6
7
8
9
sudo nano /etc/profile

//add the content at the end of file
export SCALA_HOME=/usr/local/opt/scala2.10.6/scala-2.10.6

export SPARK_HOME=/usr/local/opt/spark1.5.1/spark-1.5.1-bin-hadoop2.6
export CLASSPATH=.:${SCALA_HOME}/lib:${SPARK_HOME}/lib
export PATH=${SCALA_HOME}/bin:${SPARK_HOME}/bin:$PATH

source /etc/profile

Then we can use spark-shell or scala commands on the console at any path.

1
2
3
4
//The results will be same as before
sudo spark-shell

scala

4. Install IntelliJ

Download IntelliJ14.1 from the official website and install. When we launch it for the first time, it will tell us to make some configurations, like choosing the show schema and etc. You’d better to choose to install the Scala plugin, otherwise we will do it later.

After installing it, we can create a new project to test.

1

Click Java, we need to set up Project SDK. The path is the JDK installing path. Then click next, don’t choose Groovy and Scala. You can write HelloWorld to test whether it succeeds.

From the picture above, in the left, we can see Scala option, click it. Then click Scala, we create a new scala project. Here still needs to set up Scala SDK. After setting up, you can write some scala codes to make testing.

5. Test Spark in IntelliJ

Create a Scala project as the instructions in 4. For using spark, we need to add the spark jar file into it. Click File->Project Structures->Libraries->+, add the spark-assembly-1.5.1-hadoop2.6.0.jar from the spark installing path.

2

Then we can run the spark examples. Create a new Scala file, paste the codes below in it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// scalastyle:off println
package org.apache.spark.examples

import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// scalastyle:on println

Before running the code, we need to edit the running configurations, otherwise it will show errors.

3

In the VM Options, add the content below in it

1
-Dspark.master=local

Then run it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
...
...
...
15/11/02 16:37:06 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/11/02 16:37:06 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:40) finished in 0.330 s
15/11/02 16:37:06 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:40, took 0.878039 s
Pi is roughly 3.1377
15/11/02 16:37:06 INFO SparkUI: Stopped Spark web UI at http://192.168.1.103:4041
15/11/02 16:37:06 INFO DAGScheduler: Stopping DAGScheduler
15/11/02 16:37:06 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/11/02 16:37:06 INFO MemoryStore: MemoryStore cleared
15/11/02 16:37:06 INFO BlockManager: BlockManager stopped
15/11/02 16:37:06 INFO BlockManagerMaster: BlockManagerMaster stopped
15/11/02 16:37:06 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/11/02 16:37:06 INFO SparkContext: Successfully stopped SparkContext
15/11/02 16:37:06 INFO ShutdownHookManager: Shutdown hook called
15/11/02 16:37:06 INFO ShutdownHookManager: Deleting directory /private/var/folders/kj/trxlbgrx6rjcv1rccwgxd5xm0000gn/T/spark-6a6b84a5-e3b3-4765-8c0d-6eb9275fc377

Process finished with exit code 0

You will see the results.

Now we have configured all the environments we need to develop Spark applications. Let’s start to program!!!