PySpark or SparkSQL for Data Wrangling

Apache Spark is established as a good data processing engine for data workflows that are large and/or complex enough to benefit from distributed processing across multiple computing nodes.  I’ve created this demo from a Spark instance I spun up effortlessly and free of charge in DataBricks community. While RDD’s (Resilient Distributed Datasets) remain a technical… Read More

Python: What is Pandas’ equivalent to a just-slightly complex SQL query?

Our Python journey now takes us into Pandas DataFrames, with a native syntax very unlike SQL, especially as queries become more analytically complex. We will answer the following question, based on an included public list of employees and their jobs.  From a list where one row indicates one employee,  how many employee job titles in… Read More