Learning SQL the Badass Way

Author

Edgar J. Treischl

Learning SQL the badass way is an unpublished manuscript which I started to summarize the content of an SQL course. Please keep that in mind when reading this book.

Preface

Learning SQL the badass way shows you the basic commands to manage data with SQL. I wrote this manual because learning new things takes me a lot of time and is often designed without taking our prior knowledge about data science into account. I started to learn SQL via an online course, but instead of digging myself into the applied side of handling data with SQL, I got a lot of information about SQL in general terms, data bases, and other related topics. However, I was able to read and to write SQL even when I had no prior experience working with SQL because I have a background in R (and Python) and SQL is not rocket science. Thus, this manual is the result from taking this course, which helps me when needed and it may help you to quickly learn SQL if you have a similiar background,

The manual summarizes the content of the IBM SQL for Data Science Course (from EDx) based on own examples and data which I attained in 2022. It’s a quick and dirty introduction for R Users: I assume that you are familiar with R and I will skip many typical steps of a real SQL introduction; certainly I will not explain why we need to wrangle data in the first place. This manual only summarizes main idiosyncrasies of SQL, not concepts that you probably know from any other programming experience. Please, close this manual if you want to learn SQL in a proper way. Go and attend a real course and where are many books and other resources available. However, if you are bored from long introduction what data is, why we need to learn how to wrangle data or other common aspects that come along the data science journey, feel free to join me.

Learning SQL is beneficial but maybe we need more motivational input. ChatGTP gives us the following reasons why we should learn SQL even if you are fluent in a language such as R:

  • Efficient data management: SQL is designed to work with relational databases, which are ideal for managing large amounts of structured data. By learning SQL, R users can efficiently query, retrieve and manage data from databases using SQL commands, making their data analysis tasks more efficient.

  • Collaborative work: SQL is a common language used by data analysts and data engineers, making it easier to collaborate and share data between different teams. By learning SQL, R users can communicate more effectively with other data professionals and work collaboratively on projects.

  • Integration with R: R users often work with data that is stored in databases, and SQL provides a way to query and retrieve data from these databases directly into R. This integration allows R users to take advantage of the power of SQL for data management, while still working with their preferred R environment.

  • Advanced data manipulation: SQL provides powerful features for data manipulation, including filtering, sorting, aggregating and joining data from multiple tables. By learning SQL, R users can take advantage of these advanced features to manipulate their data in more sophisticated ways.

  • Job market demand: SQL is a widely used skill in the data analytics job market. By learning SQL, R users can broaden their skillset and increase their job market competitiveness.

ChatGTP also helps us to get an overview about the most important R packages that are commonly used for working with SQL, including:

  • dplyr: A powerful package for data manipulation that can connect to various SQL databases and perform operations such as filtering, grouping, and joining (Wickham et al. 2023).

  • DBI: An R package that provides a common interface for connecting to various SQL databases (, Wickham, and Müller 2022).

  • RMySQL and RSQLite: Packages that provide an interface for connecting to MySQL and SQLite databases, respectively (Ooms et al. 2022; Müller et al. 2023).

  • RJDBC: A package that provides a JDBC interface for connecting to various databases, including Oracle, Microsoft SQL Server, and PostgreSQL (Urbanek 2022).

  • RODBC: A package that provides an ODBC interface for connecting to various databases, including Microsoft SQL Server and PostgreSQL (Ripley and Lapsley 2022).

  • sqldf: A package that allows you to run SQL queries on data frames in R (Grothendieck 2017).

Overall, the choice of which package to use will depend on the specific database you’re working with and your preferred interface. However, these packages should provide a good starting point for working with SQL in R and I will outline more about each package soon. Before we start can start with SQL, let us set the scope of this manual:

  • Chapter 1 introduces the basics SQL commands and shows you how to run SQl queries from the R console.

  • Chapter 2 elaborates on SQL knowledge in terms of data management and introduces, for example, two different types of SQL statements.

  • Chapter 3 shows how basic calculations are done in SQL: We can sort and group data or work with multiple tables.

  • Chapter 4 is a cheat sheet. It repeats all introduced SQL commands based on a simple code snippets. By doing so, we can copy and paste the code if needed.