Project Based Learning for Apache Spark
Tutorials are great, but building projects is the best way to learn.
| # | Project | What you will learn? | Status |
|---|---|---|---|
| 1 | Word Count | Solve the same problem with different Spark API | Complete |
| 2 | Sales Analytics | Identify and fix data Skew, Salting | In progress |
| 3 | Users & Departments | Identify and fix data Skew, Repartition | Complete |
| 3 | Stackoverflow Analytics | Who earned the first badge? | TODO |
| 4 | Protect Data Cardinality | TODO |
| # | Name | Input | Output | Examples |
|---|---|---|---|---|
| 1 | UDF or built-in functions | Single row | Single return value for input row | round, substr |
| 2 | Aggregate functions | Group of rows | Single return value for for every group | avg, min |
| 3 | Window functions | Group of rows | Single value for every input row | ranking etc. |
Shuffling
Windowing