Project Based Learning for Apache Spark
Tutorials are great, but building projects is the best way to learn.
# | Project | What you will learn? | Status |
---|---|---|---|
1 | Word Count | Solve the same problem with different Spark API | Complete |
2 | Sales Analytics | Identify and fix data Skew, Salting | In progress |
3 | Users & Departments | Identify and fix data Skew, Repartition | Complete |
3 | Stackoverflow Analytics | Who earned the first badge? | TODO |
4 | Protect Data Cardinality | TODO |
# | Name | Input | Output | Examples |
---|---|---|---|---|
1 | UDF or built-in functions | Single row | Single return value for input row | round , substr |
2 | Aggregate functions | Group of rows | Single return value for for every group | avg , min |
3 | Window functions | Group of rows | Single value for every input row | ranking etc. |
Shuffling
Windowing