Apache Spark Hands-On Training (In-Company)

Want to learn Spark fast, practice it, and get yourself a flying start ?

ON REQUEST
Location: In-company (YOUR COMPANY)
Presented in English by Geert Van Landeghem
Price: ASK FOR PRICE QUOTE (excl. 21% VAT)

Learning Objectives

Highlights of this Workshop:

We focus on Spark 2, the latest version
We combine theory and practice through realistic and increasingly complex exercises
These exercises can be developed and run in the Databricks Cloud Community Edition through a Spark Notebook environment in the browser, so there is no need for local installations on your laptop - a browser is all you need
The Databricks Cloud environment allows to use Scala, Python, SQL and R in an interactive way
Presented by an expert in big data, Hadoop, Spark and the Databricks environment

Why do we organise this workshop about Apache Spark ?

Big Data lays the foundation for data-driven business, the future of business
Apache Hadoop is great for storing and processing big data volumes in batch, but has its limitations
Apache Spark is the fast and more general-purpose engine for large-scale data processing
IBM called Apache Spark "most important new open source project in a decade"
Spark programs can be 100 times faster than Hadoop/MapReduce in memory, or 10 times faster on disk
Spark is easy to use, because you can write applications quickly in Java, Scala, Python, R
It is an open source data analytics cluster computing framework that supports different types of data analysis within the same technology stack: fast interactive queries, streaming analysis, graph analysis and machine learning
During this two-day hands-on workshop, we discuss the theory and practice of several data analysis applications, and make sure your understand the framework, the environment and how to successfully run your own Spark projects

Who should attend this workshop?

This workshop is mainly aimed at developers, data analysts, data scientists, architects, software engineers and IT operations who want to develop Apache Spark applications. This course uses a hands-on approach to teach you the basics of Spark and give you a flying start.

You get an introduction to all Spark components from the perspective of "the data developer". Some experience with programming is necessary to get the most out of this course.

Please bring a laptop to the course. We'll run the exercises in a Notebook environment in the browser (no additional software needed on the laptop) via the Databricks cloud platform. Exercises vary from easy to complex, gradually adding functionality. Scala is our language of choice, but Python is possible as well.

We also offer this training as an in-house course for a minimum of 6 people from your company. The typical cost for an in-house training is 3.500 euro per day, excluding VAT, preparation, travel and hotel accommodation (if applicable).

Full Programme

WELCOME

Introducing the speaker, participants and workshop

INTRO TO SPARK

What is Apache Spark ?

Where does Spark come from ?
Why has it grown so quickly to the most popular cluster computing framework ?
What are the advantages compared to Hadoop and MapReduce ?
What is new in Spark 2.0 ?
Data Engineering vs Data Science

Notebooks

Notebooks: interactive programs that allow you to do data analysis and visualise the results
Writing Spark programs using notebooks (Zeppelin, Spark Notebook, Databricks Cloud)

Just Enough Scala and Python

Spark was developed in Scala, a high-level programming language that combines object-oriented and functional programming.Programming Spark applications in Scala is straightforward for anyone who is familiar with a programming language. We look at the definition of variables, functions and the use of collections in Scala.

However, because a lot of data science and statistical applications are currently programmed in Python, the open source community has developed a wonderful toolkit called PySpark, to expose the Spark programming model to Python.

We make sure that you are very familiar with the programming environment, so that you can start solving increasingly complex exercises.

GETTING STARTED

Spark Basics

We look at the Spark Core API from the perspective of the "Data Developer": from prototyping in the Spark Shell to the compilation and packaging of Spark applications for a cluster, and how this application is efficiently executed on a cluster.

The following topics will be covered:

Spark Shell: the interactive shell for doing data analysis in Spark in an interactive way
RDD (Resilient Distributed Datasets): a distributed collection of objects, the most important concept in Spark
Transformations & Actions: operations on RDDs
Job Execution
Clustering

END OF DAY 1

End of Day 1 of this Workshop

DAY 2: WELCOME BACK

Welcome to Day 2 with Coffee/Tea

DATAFRAMES and DATASETS

Spark SQL

SQL
DataFrames: a distributed collection of data organized into named columns
Datasets

LUNCH

MORE ADVANCED EXERCISES

Putting it all together

More extended, guided exercises in which most of the Spark modules are combined, showing the true power of Spark

FINISH

End of this two-day Workshop

Speakers

Geert Van Landeghem (DataCrunchers)

Geert Van Landeghem is a Big Data consultant with 25 years of experience working for companies across industries. He worked on his first big data project in 2011, and is still consulting companies on how to adopt big data within their organisation.

He has worked as the Head of BI for a gambling company in Belgium, where he led a team of 8 people. He is an Apache Spark Certified Developer since November 2014, and has worked as an instructor for IBM and Datacrunchers, where he teaches Hadoop and Spark-related courses.

He is currently examining how Artificial Intelligence can be used for business use cases and as such followed the first IBM Watson and O'Reilly AI conferences abroad.

Check out our related in-house workshops:

Google BigQuery in Practice (INHOUSE WORKSHOP - On Request)
Het Logisch Datawarehouse - Architectuur, Ontwerp en Technologie (INHOUSE WORKSHOP - On Request)
Business Intelligence en Datawarehousing Fundamentals (INHOUSE WORKSHOP - On Request)
The Hadoop Ecosystem (INHOUSE WORKSHOP - On Request)
Big Data Oplossingen voor BI (INHOUSE WORKSHOP - On Request)
Lean Business Analyse (INHOUSE WORKSHOP - On Request)
Business Analysis Agility (INHOUSE WORKSHOP - On Request)
Minimum Viable Products (MVPs) Demystified (INHOUSE WORKSHOP - On Request)
Aan de Slag met RPA, UiPath en Blue Prism (INHOUSE WORKSHOP - On Request)
Data Vault in a Day (INHOUSE WORKSHOP - On Request)

dit is een inhouse