1  Introduction

The biggest mistake trying to learn SQL first. If you ever opened a book that introduces SQL (or any other programming language) to will find several chapters that outline what SQL (structured query language) is. An explanation from the Wikipedia page should be sufficient: “SQL a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS)”. Thus, in a nutshell we can remember that SQL is a language to manage data (tables) in databases.

The next chapters of a typical introduction shows you how to setup the software and the database. I’ll skip this step because setting up a database on your own computer does not make much sense and there are smarter ways to learn the first steps if you are a R user. For example, this website includes SQL code and output but this does not imply that I have connected to a database. Instead we can use R packages such as DBI to setup the connection and include your SQL code directly in your RMarkdown document or an R script. There is no need for a database in the beginning, we can make use of the the local memory to simulate a database, save a table (data frame) and run SQL commands directly as a code chunk from the R script. Consider the next console, it shows SQL code to select a variable from a data set as an example.

SELECT VAR FROM data;

To run this code from R without a database, we need to establish a local connection (con), writte the data into the local data base with dbWriteTable(). The next console shows how it works.

library(DBI)
library(RSQLite)

# Create in-memory RSQLite database
con <- dbConnect(drv = RSQLite::SQLite(), 
                 dbname = ":memory:")

#Write a table into the data base
dbWriteTable(conn = con, 
             name = "mtcars", 
             value = mtcars)

I hope you are familiar with the mtcars and the iris data if you want to reproduce these first SQL steps. Both are implemented in R and in this chapter we explore first SQL commands with this data. To this end, you need to insert a table into the data base. As the last console for the mtcars data highlighted. The dbListTables() functions lists all loaded tables of the used data base.

dbListTables(con)
[1] "df"     "iris"   "mtcars"

After we have made a connection and the data is available, the sqldf package let us run SQL code from R.

#Run SQL code from R
library(sqldf)
sqldf('SELECT * FROM mtcars LIMIT 5;')
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Thus, such R packages increases your learning curve substanially, since we can establish a data base and try some SQL code without any other equipment than your local machine. In Chapter XX we will establish a connection to a SQL data base, but for the moment there is no need to. The remaining part of this chapter shows first SQL commands.

1.1 Select

  • Use select to retrieve a table or a column from a table
  • You can select a single column from a table
  • Or select the entire table (data frame) with the wildcard *
SELECT * FROM mtcars;
Displaying records 1 - 10
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
  • Limit the output by providing the number of lines
SELECT mpg, disp FROM mtcars LIMIT 5;
5 records
mpg disp
21.0 160
21.0 160
22.8 108
21.4 258
18.7 360
  • You can also insert a starting point that skips some observations. For instance, OFFSET 10 will skip the first ten table entries
  • Use quotations marks if the column contains special characters (like 'Petal.Width' from iris data)

1.2 Where

  • Define what you want to select with the Where option (SQL folks say clause)
  • For instance, the variable am is a binary indicator (0/1) and you can use where to select data only if am = 0
SELECT * FROM mtcars WHERE am = 0 LIMIT 5;
5 records
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
  • Remember to use quotation marks if you try to use where with non-numerical values from: e.g. != 'label'
SELECT * FROM iris WHERE Species = "virginica"  LIMIT 5;
5 records
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica

1.3 Count

  • Count counts cases!
SELECT COUNT(*) FROM mtcars;
1 records
COUNT(*)
32
  • We can count subgroups with the WHERE clause:
SELECT COUNT(am) FROM mtcars WHERE am != 0;
1 records
COUNT(am)
13

1.4 DISTINCT

  • Distinct can be used to find distinct values. For instance, there are three different species in the iris data:
SELECT DISTINCT Species FROM iris
3 records
Species
setosa
versicolor
virginica
  • As in other programming languages, we can combine several commands. For instance, we can COUNT how many Distinct species the iris data has:
SELECT COUNT (DISTINCT Species) FROM iris
1 records
COUNT (DISTINCT Species)
3

1.5 Insert Values

  • Next, I use a small data set (df) to illustrate how to insert values, make updates, and delete cases

  • My toy data set df has two observations with three variables: x,y,z

  • Never mind if you do not know what a tribble is, it is just a command to create data

library(tidyverse)

df <- tribble(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)
df
# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b         1   8.5
  • Now, we can insert new values into df by providing a list of the columns you want to fill in with values for each column:
INSERT INTO df (x, y, z) VALUES('c', 3, 1);

Let’s see whether it worked:

SELECT * FROM df;
3 records
x y z
a 2 3.6
b 1 8.5
c 3 1.0

1.6 Updates

  • Make updates for single (or multiple) values
  • For instance, we can update the variable z and set z = 77 for a certain level of another variable:
UPDATE df SET z = 77 WHERE x ='b';
  • Take care, without the WHERE clause all observation would get the new value!
SELECT * FROM df;
3 records
x y z
a 2 3.6
b 1 77.0
c 3 1.0

1.7 Delete

  • We can drop or delete observations, but of course we should take care since we probably do not want to delete the entire table, just for some implausible values

  • For this reason we use the WHERE clause again, for instance, to get rid of second row of the toy data set:

DELETE from df WHERE x = 'b';
SELECT * FROM df;
2 records
x y z
a 2 3.6
c 3 1.0

In summary, we have to select a table from the database, specify conditions with the where clause and we can use count to get a first impression of the data. We saw how to insert values, a really vague concept if you are used to work with data, but from a SQL perspective you give the database a update or imagine that a stream of new data needs an update. If SQL seems like something you have to get used to it, don’t be afraid, me too.