SELECT VAR FROM data;
1 Introduction
The biggest mistake trying to learn SQL first. If you ever opened a book that introduces SQL (or any other programming language) to will find several chapters that outline what SQL (structured query language) is. An explanation from the Wikipedia page should be sufficient: “SQL a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS)”. Thus, in a nutshell we can remember that SQL is a language to manage data (tables) in databases.
The next chapters of a typical introduction shows you how to setup the software and the database. I’ll skip this step because setting up a database on your own computer does not make much sense and there are smarter ways to learn the first steps if you are a R user. For example, this website includes SQL code and output but this does not imply that I have connected to a database. Instead we can use R packages such as DBI
to setup the connection and include your SQL code directly in your RMarkdown document or an R script. There is no need for a database in the beginning, we can make use of the the local memory to simulate a database, save a table (data frame) and run SQL commands directly as a code chunk from the R script. Consider the next console, it shows SQL code to select a variable from a data set as an example.
To run this code from R without a database, we need to establish a local connection (con
), writte the data into the local data base with dbWriteTable()
. The next console shows how it works.
library(DBI)
library(RSQLite)
# Create in-memory RSQLite database
<- dbConnect(drv = RSQLite::SQLite(),
con dbname = ":memory:")
#Write a table into the data base
dbWriteTable(conn = con,
name = "mtcars",
value = mtcars)
I hope you are familiar with the mtcars
and the iris
data if you want to reproduce these first SQL steps. Both are implemented in R and in this chapter we explore first SQL commands with this data. To this end, you need to insert a table into the data base. As the last console for the mtcars
data highlighted. The dbListTables()
functions lists all loaded tables of the used data base.
dbListTables(con)
[1] "df" "iris" "mtcars"
After we have made a connection and the data is available, the sqldf
package let us run SQL code from R.
#Run SQL code from R
library(sqldf)
sqldf('SELECT * FROM mtcars LIMIT 5;')
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Thus, such R packages increases your learning curve substanially, since we can establish a data base and try some SQL code without any other equipment than your local machine. In Chapter XX we will establish a connection to a SQL data base, but for the moment there is no need to. The remaining part of this chapter shows first SQL commands.
1.1 Select
- Use select to retrieve a table or a column from a table
- You can select a single column from a table
- Or select the entire table (data frame) with the wildcard
*
SELECT * FROM mtcars;
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
- Limit the output by providing the number of lines
SELECT mpg, disp FROM mtcars LIMIT 5;
mpg | disp |
---|---|
21.0 | 160 |
21.0 | 160 |
22.8 | 108 |
21.4 | 258 |
18.7 | 360 |
- You can also insert a starting point that skips some observations. For instance,
OFFSET 10
will skip the first ten table entries - Use quotations marks if the column contains special characters (like
'Petal.Width'
from iris data)
1.2 Where
- Define what you want to select with the
Where
option (SQL folks say clause) - For instance, the variable
am
is a binary indicator (0/1) and you can use where to select data only ifam = 0
SELECT * FROM mtcars WHERE am = 0 LIMIT 5;
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
- Remember to use quotation marks if you try to use where with non-numerical values from: e.g.
!= 'label'
SELECT * FROM iris WHERE Species = "virginica" LIMIT 5;
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
6.3 | 3.3 | 6.0 | 2.5 | virginica |
5.8 | 2.7 | 5.1 | 1.9 | virginica |
7.1 | 3.0 | 5.9 | 2.1 | virginica |
6.3 | 2.9 | 5.6 | 1.8 | virginica |
6.5 | 3.0 | 5.8 | 2.2 | virginica |
1.3 Count
- Count counts cases!
SELECT COUNT(*) FROM mtcars;
COUNT(*) |
---|
32 |
- We can count subgroups with the
WHERE
clause:
SELECT COUNT(am) FROM mtcars WHERE am != 0;
COUNT(am) |
---|
13 |
1.4 DISTINCT
Distinct
can be used to find distinct values. For instance, there are three different species in the iris data:
SELECT DISTINCT Species FROM iris
Species |
---|
setosa |
versicolor |
virginica |
- As in other programming languages, we can combine several commands. For instance, we can
COUNT
how manyDistinct
species the iris data has:
SELECT COUNT (DISTINCT Species) FROM iris
COUNT (DISTINCT Species) |
---|
3 |
1.5 Insert Values
Next, I use a small data set (df) to illustrate how to insert values, make updates, and delete cases
My toy data set
df
has two observations with three variables:x,y,z
Never mind if you do not know what a
tribble
is, it is just a command to create data
library(tidyverse)
<- tribble(
df ~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
) df
# A tibble: 2 × 3
x y z
<chr> <dbl> <dbl>
1 a 2 3.6
2 b 1 8.5
- Now, we can insert new values into
df
by providing a list of the columns you want to fill in with values for each column:
INSERT INTO df (x, y, z) VALUES('c', 3, 1);
Let’s see whether it worked:
SELECT * FROM df;
x | y | z |
---|---|---|
a | 2 | 3.6 |
b | 1 | 8.5 |
c | 3 | 1.0 |
1.6 Updates
- Make updates for single (or multiple) values
- For instance, we can update the variable z and set
z = 77
for a certain level of another variable:
UPDATE df SET z = 77 WHERE x ='b';
- Take care, without the
WHERE
clause all observation would get the new value!
SELECT * FROM df;
x | y | z |
---|---|---|
a | 2 | 3.6 |
b | 1 | 77.0 |
c | 3 | 1.0 |
1.7 Delete
We can drop or delete observations, but of course we should take care since we probably do not want to delete the entire table, just for some implausible values
For this reason we use the
WHERE
clause again, for instance, to get rid of second row of the toy data set:
DELETE from df WHERE x = 'b';
SELECT * FROM df;
x | y | z |
---|---|---|
a | 2 | 3.6 |
c | 3 | 1.0 |
In summary, we have to select a table from the database, specify conditions with the where clause and we can use count to get a first impression of the data. We saw how to insert values, a really vague concept if you are used to work with data, but from a SQL perspective you give the database a update or imagine that a stream of new data needs an update. If SQL seems like something you have to get used to it, don’t be afraid, me too.