Apache Spark Jumpstart

Apache Spark Jumpstart


Want to learn Spark fast, master it, and get yourself a flying start in Spark projects ? Join us for this two-day hands-on workshop

10-11 August 2017 (10-18h)
Location: Parker Hotel (Diegem)
Presented in English by Geert Van Landeghem
Price: 1250 EUR (excl. 21% VAT)

This event is history, please check out the List of Upcoming Seminars, or send us an email

Check out our related in-house workshops:

 Learning Objectives

Why do we organise this workshop about Apache Spark ?

Big Data gets a lot of attention these days in ICT and marketing, because big data lays the foundation for data-driven business. Since its inception in 2007, Apache Hadoop has been looked at as the de facto standard for the storage and processing of big data volumes in batch. But every technology has its limitations, and this is no different for Hadoop: it is batch-oriented and the MapReduce framework is too limited for handling all types of data analysis within the same technology stack.

Because the volume, the speed and the complexity of data generated by mobile, social and sensors (IoT) gradually increases, the need for faster data processing and analysis to answer the needs and expectations of end users also increases.

IBM calls Apache Spark "most important new open source project in a decade"

Apache Spark solves the problem of speed and versatility by offering an "open source data analytics cluster computing framework". Spark was developed in 2009 at the AMPLab (Algorithms, Machines, and People Lab) of the University of California in Berkeley, and donated to the open source community in 2010. It is faster than Hadoop, in some cases 100 times faster, and it offers a framework that supports different types of data analysis within the same technology stack: fast interactive queries, streaming analysis, graph analysis and machine learning. During this two-day hands-on workshop, we discuss the theory and practice of several data analysis applications.

Who should attend this workshop?

This workshop is mainly aimed at developers, data analysts and data scientists who want to know more about Apache Spark. This course uses a hands-on approach to teach you the basics of Spark and give you a flying start.

You get an introduction to all Spark components from the perspective of the "data developer". Some experience with programming is necessary to get the most out of this course.

The exercises are implemented on your own laptop, and vary from easy to complex, gradually adding functionality. We will explain how to do the exercises in Scala or Python, the choice is up to you.

We also offer this training as an in-house course for a minimum of 5 people from your company.

 Full Programme

9.30h - 10.00h
Registration, Coffee/Tea and Croissants
10.00h
What is Apache Spark ?
  • Where does Spark come from ?
  • Why has it grown so quickly to the most popular cluster computing framework ?
  • What are the advantages compared to Hadoop and MapReduce ?
  • What is new in Spark 2.0 (released end of July 2016) ?
  • Data Engineering vs Data Science
 
Just Enough Scala and Python

Spark was developed in Scala, a high-level programming language that combines object-oriented and functional programming. Programming Spark applications in Scala is straightforward for anyone who is familiar with a programming language. We look at the definition of variables, functions and the use of collections in Scala.

However, because a lot of data science and statistical applications are currently programmed in Python, the open source community has developed a wonderful toolkit called PySpark, to expose the Spark programming model to Python.

11.30h
Coffee/Tea and Refreshments
 
Spark Core API

We look at the Spark Core API from the perspective of the "Data Developer": from prototyping in the Spark Shell to the compilation and packaging of Spark applications for a cluster, and how this application is efficiently executed on a cluster.

The following topics will be covered:

  • Spark Shell: the interactive shell for doing data analysis in Spark in an interactive way
  • Spark Context: to communicate with a cluster, and to make RDDs, broadcast variables and accumulators
  • Spark Master
  • RDD (Resilient Distributed Datasets): a distributed collection of objects, the most important concept in Spark
  • DataFrames: a distributed collection of data organized into named columns>/li>
  • Transformations & Actions: operations on RDDs
  • Caching
  • Spark Applications
  • Spark Execution Model
  • Notebooks: interactive programs that allow you to do data analysis and visualise the results
13.00h
Lunch
13.45h
Getting started: Exercises with Shared Variables

Having read-write shared variables across Spark tasks running on clusters would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

  • Broadcast variables: allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
  • Accumulators: Spark supports numerical accumulators that can be used as counters (like in MapReduce)
18.00h
End of Day 1 of this Workshop
9.30h - 10.00h
Welcome to Day 2 with Coffee/Tea and Croissants
10.00h
More advanced data engineering exercises with Spark in Scala or Python

We gradually increase the complexity of our exercises using RDDs and DataFrames

13.00h
Lunch
13.45h
Beyond Spark's Core API: Spark SQL

Besides Spark's core module, we look at a number of modules that were added to the Spark stack like Spark SQL, and do some exercises on processing structured data with SQL commands

15.45h
Putting it all together

A more extended, guided exercise in which most of the Spark modules are combined, showing the true power of Spark.

18.00h
End of this two-day Workshop

 Speakers


Geert Van Landeghem (DataCrunchers)
DataCrunchers

Geert Van Landeghem is a Big Data consultant with 25 years of experience working for companies across industries. He worked on his first big data project in 2011, and is still consulting companies on how to adopt big data within their organisation.

He has worked as the Head of BI for a gambling company in Belgium, where he led a team of 8 people. He is an Apache Spark Certified Developer since November 2014, and has worked as an instructor for IBM and Datacrunchers, where he teaches Hadoop and Spark-related courses.

He is currently examining how Artificial Intelligence can be used for business use cases and as such followed the first IBM Watson and O'Reilly AI conferences abroad.

Questions about this ? Interested but you can't attend ? Send us an email !

-->