Our Python journey now takes us into Pandas DataFrames, with a native syntax very unlike SQL, especially as queries become more analytically complex.
We will answer the following question, based on an included public list of employees and their jobs. From a list where one row indicates one employee, how many employee job titles in [choose a year] are held by just one employee?
If the data resided instead in a single relational table, the following, relatively simply, SQL query would provide the answer:
SELECT COUNT(*) FROM
( SELECT JobTitle, Count(Id) from sal GROUP BY JobTitle HAVING Count(Id) = 1 WHERE Year = 2013 )
I love SQL but, alas, we won’t use it here, except as a logical reference with which to write Pandas DataFrame code. Our challenge here is to pull this answer from a CSV file using Python’s Pandas library, and I’ll volunteer something this: Although Pandas has a huge variety of built-in methods that would, no doubt, make this simpler, we are (rather, I am) just dipping my toes into Pandas, so we’ll work with what I know.
To view the solution code, you can go to GitHub, and then click on ‘Pandas Analytics Query1’. It includes an easy-to-follow set of steps to derive the solution, as well as an all-in-one solution.
However, if you prefer just to see the Pandas code that will anwer the question, here is the one-line script:
dfsal[ dfsal[‘Year’]==2013].groupby(‘JobTitle’).count()[ dfsal[ dfsal[‘Year’]==2013].groupby(‘JobTitle’).count()[‘Id’]==1 ].count()[‘Id’]
Impression: The Pandas equivalent — especially as I’m just starting in Pandas — of even a slightly complex SQL query is a bit challenging and ends up being hard to read, but breaking it down step by step, as is shown in GitHub, shows that, with some familiarity with it’s syntax, completing this task is very do-able.
Our Python journey continues. Our next topic will probably involve ipython-sql and sqlalchemy. With these two Python libraries, we can connect, via ODBC or otherwise, directly to a relational database, extract data using a standard SQL query embedded in Pandas, perform various Python analytics and/or join it to other data formats, and write our results back into a relational database. Stay tuned!