Apache Spark Hands-On Training

Get yourself a flying start in Spark, according to some the Swiss knife in superfast big data analysis

3-4 February 2016 (14-21u)
Location: Golden Tulip Brussels Airport (Diegem)
Presented in English by Geert Van Landeghem
Price: 1250 EUR (excl. 21% VAT)

This event is history, please check out the List of Upcoming Seminars, or send us an email

Check out our related in-house workshops:

Google BigQuery in Practice (INHOUSE WORKSHOP - On Request)
Apache Spark Hands-On Training (In-Company) (INHOUSE WORKSHOP - On Request)
Het Logisch Datawarehouse - Architectuur, Ontwerp en Technologie (INHOUSE WORKSHOP - On Request)
Business Intelligence en Datawarehousing Fundamentals (INHOUSE WORKSHOP - On Request)
The Hadoop Ecosystem (INHOUSE WORKSHOP - On Request)
Big Data Oplossingen voor BI (INHOUSE WORKSHOP - On Request)
Lean Business Analyse (INHOUSE WORKSHOP - On Request)
Business Analysis Agility (INHOUSE WORKSHOP - On Request)
Minimum Viable Products (MVPs) Demystified (INHOUSE WORKSHOP - On Request)
Aan de Slag met RPA, UiPath en Blue Prism (INHOUSE WORKSHOP - On Request)
Data Vault in a Day (INHOUSE WORKSHOP - On Request)

Learning Objectives

Why do we organise this workshop about Apache Spark ?

Big Data is the hype of the moment in ICT and marketing. Since its inception in 2007, Apache Hadoop has been looked at as the de facto standard for the storage and processing of big data volumes in batch.

But every technology has its limitations, and this is no different for Hadoop: it is batch-oriented and the MapReduce framework is too limited for handling all types of data analysis within the same technology stack.

Because the volume and speed of data generation gradually increases, so does the need for faster data processing and analysis to answer the needs and expectations of end users.

IBM calls Apache Spark "most important new open source project in a decade"

Apache Spark solves the problem of speed and versatility by offering an "open source data analytics cluster computing framework". Spark was developed in 2009 at the AMPLab (Algorithms, Machines, and People Lab) of the University of California in Berkeley, and donated to the open source community in 2010. It is faster than Hadoop, in some cases 100 times faster, and it offers a framework that supports different types of data analysis within the same technology stack: fast interactive queries, streaming analysis, graph analysis and machine learning. During this two-day hands-on workshop, we discuss the theory and practice of several data analysis applications.

Who should attend this workshop?

This workshop is mainly aimed at developers, data analysts and data scientists who want to know more about Apache Spark. This course uses a hands-on approach to teach you the basics of Spark and give you a flying start.

You get an introduction to all Spark components from the perspective of the "data developer". Some experience with programming is necessary to get the most out of this course.

The exercises are implemented on your own laptop using Scala (unfortunately, the Spark Python API (PySpark) still gives problems), and vary from easy to complex, gradually adding functionality.

We also offer this training as an in-house course for a minimum of 5 people from your company.

Full Programme

13.30h - 14.00h

Registration, Coffee/Tea and Croissants

14.00h

What is Apache Spark ?

Where does Spark come from ?
Why has it grown so quickly to the most popular cluster computing framework ?
What are the advantages compared to Hadoop?

Just Enough Scala

Spark was developed in Scala, a high-level programming language that combines object-oriented and functional programming. We look at the definition of variables, functions and the use of collections in Scala.

15.30h

Coffee/Tea and Refreshments

Spark Core API

We look at the Spark Core API from the perspective of the "Data Developer": from prototyping in the Spark Shell to the compilation and packaging of Spark applications for a cluster, and how this application is efficiently executed on a cluster.

The following topics will be covered:

Spark Shell: the interactive shell for doing data analysis in Spark in an interactive way
Spark Context: to communicate with a cluster, and to make RDDs, broadcast variables and accumulators
Spark Master
RDD (Resilient Distributed Datasets): a distributed collection of objects, the most important concept in Spark
Transformations & Actions: operations on RDDs
Caching
Spark Applications
Spark Execution Model

18.00h

Dinner buffet with an extensive choice of cold and warm choices

18.45h

Shared Variables

Having read-write shared variables across Spark tasks running on clusters would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Broadcast variables: allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
Accumulators: Spark supports numerical accumulators that can be used as counters (like in MapReduce)

21.00h

End of Day 1 of this Workshop

13.30h - 14.00h

Welcome to Day 2 with Coffee/Tea and Croissants

14.00h

Beyond Spark's Core API

Besides Spark's core module, we look at a number of modules that were added to the Spark stack:

Spark SQL: for the analysis of structured data with SQL commands
Spark Streaming: this module allows you to analyse a continuous stream of data while it is being received. We look at a simple twitter analysis application
MLlib: a machine learning library for Spark. This allows clustering, classification and making recommendations. We look at a movie recommendation application
GraphX: a new module for making graphs available as RDDs to perform PageRanking and other graph algoritms
Notebooks: notebooks are interactive programs that allow you to do data analysis and visualise the results. We use Apache Zeppelin to do this

18.00h

Dinner buffet with an extensive choice of cold and warm choices

18.45h

Putting it all together

A more extended, guided exercise in which most of the Spark modules are combined, showing the true power of Spark.

21.00h

End of this two-day Workshop

Speakers

Geert Van Landeghem (DataCrunchers)

Geert Van Landeghem is a Big Data consultant with over 20 years of experience. He got interested in Big Data in 2010 and implemented his first Big Data project in 2011. Many big data projects later, he currently works as the Head of the BI team and Big Data architect for an online gambling company that uses Spark. He is always eager to learn new big data technologies and to translate them into new business solutions. He is also the co-organiser of the bigdata.be meetup group.

Geert was an instructor for IBM and has developed many courses for datacrunchers.eu.

In november 2014, he received the "Developer Certification for Apache Spark" from Databricks and O'Reilly.

Questions about this ? Interested but you can't attend ? Send us an email !