1 Introduction

The biggest mistake trying to learn SQL first. If you ever opened a book that introduces SQL (or any other programming language) to will find several chapters that outline what SQL (structured query language) is. An explanation from the Wikipedia page should be sufficient: “SQL a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS)”. Thus, in a nutshell we can remember that SQL is a language to manage data (tables) in databases.

The next chapters of a typical introduction shows you how to setup the software and the database. I’ll skip this step because setting up a database on your own computer does not make much sense and there are smarter ways to learn the first steps if you are a R user. For example, this website includes SQL code and output but this does not imply that I have connected to a database. Instead we can use R packages such as DBI to setup the connection and include your SQL code directly in your RMarkdown document or an R script. There is no need for a database in the beginning, we can make use of the the local memory to simulate a database, save a table (data frame) and run SQL commands directly as a code chunk from the R script. Consider the next console, it shows SQL code to select a variable from a data set as an example.

SELECT VAR FROM data;

To run this code from R without a database, we need to establish a local connection (con), writte the data into the local data base with dbWriteTable(). The next console shows how it works.

library(DBI)
library(RSQLite)

# Create in-memory RSQLite database
con <- dbConnect(drv = RSQLite::SQLite(), 
                 dbname = ":memory:")

#Write a table into the data base
dbWriteTable(conn = con, 
             name = "mtcars", 
             value = mtcars)

I hope you are familiar with the mtcars and the iris data if you want to reproduce these first SQL steps. Both are implemented in R and in this chapter we explore first SQL commands with this data. To this end, you need to insert a table into the data base. As the last console for the mtcars data highlighted. The dbListTables() functions lists all loaded tables of the used data base.

dbListTables(con)

[1] "df"     "iris"   "mtcars"

After we have made a connection and the data is available, the sqldf package let us run SQL code from R.

#Run SQL code from R
library(sqldf)
sqldf('SELECT * FROM mtcars LIMIT 5;')

   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Thus, such R packages increases your learning curve substanially, since we can establish a data base and try some SQL code without any other equipment than your local machine. In Chapter XX we will establish a connection to a SQL data base, but for the moment there is no need to. The remaining part of this chapter shows first SQL commands.

1.1 Select

Use select to retrieve a table or a column from a table
You can select a single column from a table
Or select the entire table (data frame) with the wildcard *

SELECT * FROM mtcars;

Displaying records 1 - 10
mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4

Limit the output by providing the number of lines

SELECT mpg, disp FROM mtcars LIMIT 5;

5 records
mpg	disp
21.0	160
21.0	160
22.8	108
21.4	258
18.7	360

You can also insert a starting point that skips some observations. For instance, OFFSET 10 will skip the first ten table entries
Use quotations marks if the column contains special characters (like 'Petal.Width' from iris data)

1.2 Where

Define what you want to select with the Where option (SQL folks say clause)
For instance, the variable am is a binary indicator (0/1) and you can use where to select data only if am = 0

SELECT * FROM mtcars WHERE am = 0 LIMIT 5;

5 records
mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
21.4	6	258.0	110	3.08	3.215	19.44	1	3	1
18.7	8	360.0	175	3.15	3.440	17.02	0	3	2
18.1	6	225.0	105	2.76	3.460	20.22	1	3	1
14.3	8	360.0	245	3.21	3.570	15.84	0	3	4
24.4	4	146.7	62	3.69	3.190	20.00	1	4	2

Remember to use quotation marks if you try to use where with non-numerical values from: e.g. != 'label'

SELECT * FROM iris WHERE Species = "virginica"  LIMIT 5;

5 records
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
6.3	3.3	6.0	2.5	virginica
5.8	2.7	5.1	1.9	virginica
7.1	3.0	5.9	2.1	virginica
6.3	2.9	5.6	1.8	virginica
6.5	3.0	5.8	2.2	virginica

1.3 Count

Count counts cases!

SELECT COUNT(*) FROM mtcars;

1 records
COUNT(*)
32

We can count subgroups with the WHERE clause:

SELECT COUNT(am) FROM mtcars WHERE am != 0;

1 records
COUNT(am)
13

1.4 DISTINCT

Distinct can be used to find distinct values. For instance, there are three different species in the iris data:

SELECT DISTINCT Species FROM iris

3 records
Species
setosa
versicolor
virginica

As in other programming languages, we can combine several commands. For instance, we can COUNT how many Distinct species the iris data has:

SELECT COUNT (DISTINCT Species) FROM iris

1 records
COUNT (DISTINCT Species)
3

1.5 Insert Values

Next, I use a small data set (df) to illustrate how to insert values, make updates, and delete cases
My toy data set df has two observations with three variables: x,y,z
Never mind if you do not know what a tribble is, it is just a command to create data

library(tidyverse)

df <- tribble(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)
df

# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b         1   8.5

Now, we can insert new values into df by providing a list of the columns you want to fill in with values for each column:

INSERT INTO df (x, y, z) VALUES('c', 3, 1);

Let’s see whether it worked:

SELECT * FROM df;

3 records
x	y	z
a	2	3.6
b	1	8.5
c	3	1.0

1.6 Updates

Make updates for single (or multiple) values
For instance, we can update the variable z and set z = 77 for a certain level of another variable:

UPDATE df SET z = 77 WHERE x ='b';

Take care, without the WHERE clause all observation would get the new value!

SELECT * FROM df;

3 records
x	y	z
a	2	3.6
b	1	77.0
c	3	1.0

1.7 Delete

We can drop or delete observations, but of course we should take care since we probably do not want to delete the entire table, just for some implausible values
For this reason we use the WHERE clause again, for instance, to get rid of second row of the toy data set:

DELETE from df WHERE x = 'b';

SELECT * FROM df;

2 records
x	y	z
a	2	3.6
c	3	1.0

In summary, we have to select a table from the database, specify conditions with the where clause and we can use count to get a first impression of the data. We saw how to insert values, a really vague concept if you are used to work with data, but from a SQL perspective you give the database a update or imagine that a stream of new data needs an update. If SQL seems like something you have to get used to it, don’t be afraid, me too.