Apache Spark Hands-On Training

Apache Spark Hands-On Training

Get yourself a flying start in Spark, according to some the Swiss knife in superfast big data analysis

21-22 October 2015 (14-21u)
Location: Golden Tulip Brussels Airport (Diegem)
Presented in English by Geert Van Landeghem
Price: 1250 EUR (excl. 21% VAT)
Register Now » AGENDA » SPEAKERS »

This event is history, please check out the List of Upcoming Seminars, or send us an email

Check out our related open workshops:

Check out our related in-house workshops:

Full Programme:
13.30h - 14.00h
Registration, Coffee/Tea and Croissants
What is Apache Spark ?
  • Where does Spark come from ?
  • Why has it grown so quickly to the most popular cluster computing framework ?
  • What are the advantages compared to Hadoop?
Just Enough Scala

Spark was developed in Scala, a high-level programming language that combines object-oriented and functional programming. We look at the definition of variables, functions and the use of collections in Scala.

Coffee/Tea and Refreshments
Spark API

We look at the Spark API from the perspective of the "Data Developer": from prototyping in the Spark Shell to the compilation and packaging of Spark applications for a cluster, and how this application is efficiently executed on a cluster.

The following topics will be covered:

  • Spark Shell: the interactive shell for doing data analysis in Spark in an interactive way
  • Spark Context: to communicate with a cluster, and to make RDDs, broadcast variables and accumulators
  • Spark Master
  • RDD (Resilient Distributed Datasets): a distributed collection of objects, the most important concept in Spark
  • Transformations & Actions: operations on RDDs
  • Caching
  • Spark Applications
  • Spark Execution Model
Dinner buffet with an extensive choice of cold and warm choices
Shared Variables

Having read-write shared variables across Spark tasks running on clusters would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

  • Broadcast variables: allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
  • Accumulators: Spark supports numerical accumulators that can be used as counters (like in MapReduce)
Tuning and Performance

Which techniques are available to speed up Spark programs ? We look at and compare multiple concepts and approaches.

End of Day 1 of this Workshop
13.30h - 14.00h
Welcome to Day 2 with Coffee/Tea and Croissants
Advanced Spark

Besides the Spark core module, we look at a number of modules that were added to the Spark stack:

  • Spark SQL: for the analysis of structured data with SQL commands
  • Spark Streaming: this module allows you to analyse a continuous stream of data while it is being received. We look at a simple twitter analysis application
  • MLlib: a machine learning library for Spark. This allows clustering, classification and making recommendations. We look at a movie recommendation application
  • GraphX: a new module for making graphs available as RDDs to perform PageRanking and other graph algoritms
  • Notebooks: notebooks are interactive programs that allow you to do data analysis and visualise the results. We use Apache Zeppelin to do this
Dinner buffet with an extensive choice of cold and warm choices
Big Data Architecture with Spark
  • How does Spark fit in your architecture ?
  • How supplementary is Spark to e.g. Hadoop, ElasticSearch and MongoDB ?
  • What are the Kappa and Lambda architectures ?
Putting it all together

A more extended, guided exercise in which most of the Spark modules are combined, showing the true power of Spark.

End of this two-day Workshop

Questions about this ? Interested but you can't attend ? Send us an email !