Apache Spark Hands-On Training (In-Company)
Highlights of this Workshop:
- We focus on Spark 2, the latest version
- We combine theory and practice through realistic and increasingly complex exercises
- These exercises can be developed and run in the Databricks Cloud Community Edition through a Spark Notebook environment in the browser, so there is no need for local installations on your laptop - a browser is all you need
- The Databricks Cloud environment allows to use Scala, Python, SQL and R in an interactive way
- Presented by an expert in big data, Hadoop, Spark and the Databricks environment
Why do we organise this workshop about Apache Spark ?
- Big Data lays the foundation for data-driven business, the future of business
- Apache Hadoop is great for storing and processing big data volumes in batch, but has its limitations
- Apache Spark is the fast and more general-purpose engine for large-scale data processing
- IBM called Apache Spark "most important new open source project in a decade"
- Spark programs can be 100 times faster than Hadoop/MapReduce in memory, or 10 times faster on disk
- Spark is easy to use, because you can write applications quickly in Java, Scala, Python, R
- It is an open source data analytics cluster computing framework that supports different types of data analysis within the same technology stack: fast interactive queries, streaming analysis, graph analysis and machine learning
- During this two-day hands-on workshop, we discuss the theory and practice of several data analysis applications, and make sure your understand the framework, the environment and how to successfully run your own Spark projects
Who should attend this workshop?
This workshop is mainly aimed at developers, data analysts, data scientists, architects, software engineers and IT operations who want to develop Apache Spark applications. This course uses a hands-on approach to teach you the basics of Spark and give you a flying start.
You get an introduction to all Spark components from the perspective of "the data developer". Some experience with programming is necessary to get the most out of this course.
Please bring a laptop to the course. We'll run the exercises in a Notebook environment in the browser (no additional software needed on the laptop) via the Databricks cloud platform. Exercises vary from easy to complex, gradually adding functionality. Scala is our language of choice, but Python is possible as well.
We also offer this training as an in-house course for a minimum of 6 people from your company. The typical cost for an in-house training is 3.500 euro per day, excluding VAT, preparation, travel and hotel accommodation (if applicable).
Introducing the speaker, participants and workshop
INTRO TO SPARK
What is Apache Spark ?
- Where does Spark come from ?
- Why has it grown so quickly to the most popular cluster computing framework ?
- What are the advantages compared to Hadoop and MapReduce ?
- What is new in Spark 2.0 ?
- Data Engineering vs Data Science
- Notebooks: interactive programs that allow you to do data analysis and visualise the results
- Writing Spark programs using notebooks (Zeppelin, Spark Notebook, Databricks Cloud)
Just Enough Scala and Python
Spark was developed in Scala, a high-level programming language that combines object-oriented and functional programming.Programming Spark applications in Scala is straightforward for anyone who is familiar with a programming language. We look at the definition of variables, functions and the use of collections in Scala.
However, because a lot of data science and statistical applications are currently programmed in Python, the open source community has developed a wonderful toolkit called PySpark, to expose the Spark programming model to Python.
We make sure that you are very familiar with the programming environment, so that you can start solving increasingly complex exercises.
We look at the Spark Core API from the perspective of the "Data Developer": from prototyping in the Spark Shell to the compilation and packaging of Spark applications for a cluster, and how this application is efficiently executed on a cluster.
The following topics will be covered:
- Spark Shell: the interactive shell for doing data analysis in Spark in an interactive way
- RDD (Resilient Distributed Datasets): a distributed collection of objects, the most important concept in Spark
- Transformations & Actions: operations on RDDs
- Job Execution
END OF DAY 1
End of Day 1 of this Workshop
DAY 2: WELCOME BACK
Welcome to Day 2 with Coffee/Tea
DATAFRAMES and DATASETS
- DataFrames: a distributed collection of data organized into named columns
MORE ADVANCED EXERCISES
Putting it all together
More extended, guided exercises in which most of the Spark modules are combined, showing the true power of Spark
End of this two-day Workshop
Geert Van Landeghem is a Big Data consultant with 25 years of experience working for companies across industries. He worked on his first big data project in 2011, and is still consulting companies on how to adopt big data within their organisation.
He has worked as the Head of BI for a gambling company in Belgium, where he led a team of 8 people. He is an Apache Spark Certified Developer since November 2014, and has worked as an instructor for IBM and Datacrunchers, where he teaches Hadoop and Spark-related courses.
He is currently examining how Artificial Intelligence can be used for business use cases and as such followed the first IBM Watson and O'Reilly AI conferences abroad.
Check out our related in-house workshops:
dit is een inhouse
Questions about this ? Interested but you can't attend ? Send us an email !