Featured

Amazon Quicksight: Deep Dive

One of my goals in this series on AWS Serverless Analytics has been to demonstrate how Amazon Quicksight allows us to build, share, and secure data visualizations and reports with minimal work associated with managing server hardware, operating systems or applications.  In previous entries, I have explored AWS Glue, S3, Amazon Athena and, at a… Read More

Serverless Data Engineering: Hands On with AWS Glue, Aurora, and Athena

This post follows up from my recent one entitled ‘AWS Serverless Analytics: The Promise…’ in which I described the value proposition for serverless analytics. In today’s update, I have a database hosted in Amazon Aurora, which we will crawl and automatically catalog with AWS Glue, load it into an S3 data lake using Glue, and… Read More

AWS Serverless Analytics: The Promise

As defined at cloudflare.com, a virtual machine, is “software that imitates a complete computer system [my note: an operating system, applications, network interfaces; everything except hardware].  It is isolated from the rest of the machine that hosts it and behaves as if it were the only OS on it…”  A container, which does not have… Read More

PySpark or SparkSQL for Data Wrangling

Apache Spark is established as a strong data processing engine for data workflows that are large or complex enough to benefit from distributed processing across multiple compute nodes.  I’ve created this demo from a Spark instance I spun up effortlessly and free of charge in DataBricks community. While RDD’s (Resilient Distributed Datasets) remain a foundation… Read More

Python: What is Pandas’ equivalent to a just-slightly complex SQL query?

Our Python journey now takes us into Pandas DataFrames, with a native syntax very unlike SQL, especially as queries become more analytically complex. We will answer the following question, based on an included public list of employees and their jobs.  From a list where one row indicates one employee,  how many employee job titles in… Read More