Jason Bell - Machine Learning, Data Streaming, Quant Finance and Algorithmic Expert

title: “Running Scala scripts in #Spark” date: 2015-02-21 categories: - “coding” - “data-and-statistics”

- “spark”

The Spark shell serves us all well, you can quickly prototype some simple lines of Scala (or Python with PySpark) and you quit the program with a little more insight than you started with.

There are points in time when those scraps of code are handy enough to warrant keeping hold of them. Scala is nice in the sense that you can either run the script without compiling or you can compile your code to a full application.

WordCount From The Shell

Take the (classic) word count functionality. With Spark it’s a doddle…

scala> val text = sc.textFile("/Users/Jason/coffee.csv")
scala> val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
scala> counts.collect

15/02/21 14:52:55 INFO DAGScheduler: Job 0 finished: collect at <console>:17, took 0.898995 s
res0: Array[(String, Int)] = Array((Tea,66461), (Latte,8324), (Capuccino,8391), (Flat_White,8499), (Americano,8325))

It’s not fun retyping that in every time you want to do a quick word count though.

WordCount From The Command Line with a Script

Saving the lines you ran in the shell as a script is easy enough to do. Create a text file, let’s call this one wc.scala

To run from the command is just a case of firing up the shell again but using the -i flag to specify an input file.

$SPARKHOME/bin/spark-shell -i wc.scala

Note that the shell doesn’t exit. So edit your wc.scala file and add an exit call as the last line.

System.exit(0)