Expanding on my recent post on Serverless Data Engineering with AWS Glue, note that Athena is another AWS managed service from which we can perform queries on an S3 data lake, connected via the query-able AWS Glue data catalog, using the full set of standard SQL, including complex joins, subqueries, string manipulations, and window (aka… Read More
Serverless Data Engineering: Hands On with AWS Glue, Aurora, and Athena
This post follows up from my recent one entitled ‘AWS Serverless Analytics: The Promise…’ in which I described the value proposition for serverless analytics. In today’s update, I have a database hosted in Amazon Aurora, which we will crawl and automatically catalog with AWS Glue, load it into an S3 data lake using Glue, and… Read More
Pre-Clinical Biopharmaceutical B&D: Data Modeling Amid Scientific Complexity
The following data model diagram is a reference for the ‘Challenges and Solutons’ entry of the same title, available here. To protect intellectual property, the image is intentionally blurred. It’s not your eyes. (-;
Python: What is Pandas’ equivalent to a just-slightly complex SQL query?
Our Python journey now takes us into Pandas DataFrames, with a native syntax very unlike SQL, especially as queries become more analytically complex. We will answer the following question, based on an included public list of employees and their jobs. From a list where one row indicates one employee, how many employee job titles in… Read More
NumPy: Index, Slice, and Aggregate a 2D Array
Python’s NumPy library is fun in that it’s easy to work with multi-dimensional data. For simplicity, consider a 2D array (aka matrix). I wrote some code to demonstrate the creation, simple visualization, slicing, and aggregation of data within a matrix, including totals and slice-subtotals. Source Code: It is available in Git Hub: NumPy 2D Array… Read More
Python Object-Oriented Programming: Doing Math Just Once Beats Repetition
Although I don’t know whether OOP will be central to our exploration of NumPy, Pandas and other Python libraries for analytics, here is a simple example of what I find useful. I want to be able to perform any one of a set of related x,y matrix expressions, and do so repeatedly without re-specifying… Read More
Python Moment: Is ‘Never Odd or Even’ a Palindrome?
Quick little geek-out here: Had some initial fun with Python string manipulations in order to detect a palindrome, defined here as a word or phrase (perhaps a very long phrase) spelled the same when reversed as when forward. Had to dig just a bit deeper to accommodate any blank spaces that would otherwise violate the… Read More
Live Presentation: Lean Data Model Storming For Project Leaders
Data Models are a’changin! To learn about these changes, please join me Saturday, Oct 15, as I present “Lean Data Model Storming for Data Project Leaders” at the Southland Technology (SoTec) Conference 2016. To view my session abstract, click here. This premier event, underwritten by PMI, AITP, IIBA and QAI, will bring together hundreds of… Read More
Information Modeling for Biopharmaceutical R&D
I’ve just added a new entry to ‘Challenge – Solution – Impact’ based on a recent engagement. Click on the above title to have a look.
Data Preparation Is Easy with Alteryx
Without support from I.T., analysts increasingly need to perform data preparation tasks of varying complexity in order to wrangle data into shape for current analytic needs. Using Alteryx Designer, many such tasks are simple and intuitive. Let’s consider an example. For the completed Alteryx workflow sample published in Alteryz product documentation, assume that, due to… Read More