SQL for Data Science
December 15th, 2017
This coursera mooc is offered by the University of California, Davis. It uses SQLite as its teaching language.
I'll be honest: I have always put off learning databases as I have never needed them in my personal projects. But as a lifelong learner, and given SQL's ubiquity in industry, I knew it was a perspective worth learning, worth putting in the effort, and finally time to begin learning.
SQL is worth learning. As for this particular mooc itself: I don't know if I'd recommend it. Actually I'm on the fence about it. The instructor Sadie St. Lawrence is good, she puts in the personal care of any good teacher, it shows in the videos, but in terms of the course content and pace: It depends on what you're looking for and the background you're coming from. This mooc worked for me because I am self-motivated, I have enough experience working with data (in R for example), that I could piece together everything that is otherwise missing or rushed in a four week course.
If you have a strong coding background already and just want a brief introduction into the landscape and best practices, knowing you can then move onto reference manuals for practice, then yes, this course is worth it. If you're just starting out in coding though, I'd say maybe not.
course strengths: Concepts as well as grammar are introduced and explained with time and care, notably context is given when looking at comparisons and exceptions (to grammar constructs or SQL implementations for example). Best practices were frequently given.
course weaknesses: Actual code examples were lacking overall. With a four week course there's not enough time to look at broader design issues with the language, efficiency tradeoffs, optimizations, etc.
Week 1 - Selecting and Retrieving Data with SQL
December 15th, 2017
As with any first week, we got "the tour"—expectations of the course. What I found valuable in particular was the attention paid to giving an overview of "sequel" (I now know this is how it's pronounced). Although we're not exposed to too much grammar just yet, we're given enough about what it is, where and how it fits into industry that I feel I'm starting to get a feel for SQL as a technology.
By way of analogy, if you look at web browsers you have a "franchise"-like model there, where HTML/CSS/Javascript is a technology stack specification, and each actual browser (such as firefox or chrome) are vendors. If they're specification compliant, it means they predictably read webpages according to the standard.
SQL then is a specification like HTML, but it's weaker so that its vendors can reinterpret the grammar itself just a little. You end up with more variation for each database management system. You end up with different dialects. Furthermore, the prevalence of databases in industry means a lot of modern programming languages such as python or R offer support for interacting with database files directly.
Finally, the most useful content in this week was the introduction to data modelling with entities, attributes, relationships, with additional offsite links exploring ER Diagrams, not to mention the contrast with NoSQL technologies.
grammar: SELECT, CREATE TABLE, INSERT INTO
Week 2 - Filtering, Sorting and Calculating Data with SQL
December 22nd, 2017
My favorite part of this week is the Chinook Database. Another reason why I haven't learned databases until now is I've never found a nice practice database to work with. I think partly I like Chinook because it has Alberta addresses, which is local for me, so it feels a little more personal :)
Otherwise, this week really gets us started into the guts of the grammar, using WHERE as the platform for various filtering, soring, and general comparison tests. We're introduced to math operators and aggregate functions such as COUNT, not to mention we get to see the use of wildcards % _ []. I like how it's noted [] is implementation specific and not actually in sqlite (but worth knowing when working with alternative implementations of SQL). Finally I also like the attention to detail in contrasting IN with OR.
grammar: SELECT FROM WHERE, WHERE IN, WHERE OR, ORDER BY, GROUP BY (HAVING).
Week 3 - Subqueries and Joins with SQL
December 29th, 2017
Our third week is back to theory. It took me a little while to actually figure out joins.
I thought the exposition could have been better. The overview was decent, offering screenshot worthy reference material for join types such as inner joins, left joins, aliases, as well as self-joins for example. Right joins are mentioned but aren't in sqlite, and are isomorphic to left joins anyway. Regardless, I still had to go outside the course to find material introducing joins which made more sense to me, and it only really started making sense after getting some practice. If it were me explaining joins to someone else (with the understanding I now have), I would put it like this: The prerequisite concept needed to understand joins is the idea of comparison. What does it mean to compare?
We take it for granted. This lack of understanding is the cause of many misguided political debates for example, especially regarding social welfare: "This group would be better off if they adapted [this] way of existing!" The [this] being the detail which people think is comparable but don't realize is not. More simply put: If you have a group of 5 apples and a group of 5 potatoes, you've abstracted out a common pattern from each which allows them to be comparable. But if for example you've abstracted out some property of fruit and some property of vegetables not mutually shared, then from this perspective they're not comparable. The point is, you can't automatically assume things are comparable, you have to specify a measure of comparison and show it is applicable. People take this for granted.
As for joins, they're a way of merging two tables. Merging not by structure but by content. In order to do this there needs to be a means of comparing content to know what to include. That's why certain columns are given keys. Up until now I had wondered the point of keys. If two tables share a common column which allows for comparison, that's how those tables can be merged, by matching up rows. Beyond that realization then, different types of joins are just about variations on this overall theme.
Beyond that, subqueries were also discussed. I especially liked the focus (if not limited) on best practices for runtime optimizations, and their general tradeoffs with other approaches.
Week 4 - Modifying and Analyzing Data with SQL
January 5th, 2018
I recall reading "Nobody actually knows how GROUP BY works!" in one of the auxiliary links to this course. That's been bothering me, and even this mooc doesn't really go into detail about how this grammar point works, so I've been exploring it further on my own.
I can't speak to the full expressivity of it, but as best I can tell it works a lot like fold (or accumulate, or reduce; different names for the same thing) in functional languages. It takes a row, and looks for the next group by match, and replaces it. It does this until it goes through the table. It only keeps the last row. This pattern then changes by filtering which columns to keep (SELECT), and then only acts like our functional fold when you add in aggregate math operators such as SUM() or COUNT(). When looking at it this way, it makes so much more sense now!
This final week introduces strings and their core operators: concatenation, trimming, substrings, data/time strings. We finally also get to see control flow branching with CASE. It's pretty standard stuff you'll find in any library on data manipulation in modern languages.
Lastly, the week ends with a light discussion on views which I'd like to explore further. I admit I don't think I fully appreciate their value just yet. It ends with how SQL is used in data science, which is expected given the title of the course. I greatly appreciated the best practice advice on data governance, profiling and other troubleshooting strategies for debugging. Learning a language is never just about grammar after all.