learn-spark

Project Based Learning for Apache Spark

View the Project on GitHub yaravind/learn-spark

PBL (Project Based Learning) approach to learning Apache Spark

Tutorials are great, but building projects is the best way to learn.

Projects

# Project What you will learn? Status
1 Word Count Solve the same problem with different Spark API Complete
2 Sales Analytics Identify and fix data Skew, Salting In progress
3 Users & Departments Identify and fix data Skew, Repartition Complete
3 Stackoverflow Analytics Who earned the first badge? TODO
4 Protect Data Cardinality   TODO

Notes

  1. Data Skews
  2. Repartition
  3. Performance Tuning
  4. Task Serialization
  5. Garbage Collection

3 Spark SQL function types

# Name Input Output Examples
1 UDF or built-in functions Single row Single return value for input row round, substr
2 Aggregate functions Group of rows Single return value for for every group avg, min
3 Window functions Group of rows Single value for every input row ranking etc.

Additional Joins

References

Shuffling

Windowing