Lecture 20
| Task | Timing (ns) | Timing (μs) |
|---|---|---|
| L1 cache reference | 0.5 | 0.0005 |
| L2 cache reference | 7 | 0.007 |
| Main memory reference | 100 | 0.1 |
| Random seek SSD | 150,000 | 150 |
| Read 1 MB sequentially from memory | 250,000 | 250 |
| Read 1 MB sequentially from SSD | 1,000,000 | 1,000 |
| Disk seek | 10,000,000 | 10,000 |
| Read 1 MB sequentially from disk | 20,000,000 | 20,000 |
| Send packet CA->Netherlands->CA | 150,000,000 | 150,000 |
Let’s imagine we have a 10 GB flat data file and that we want to select certain rows based on a particular criterion (filtering).
This requires a sequential read across the entire data set.
| File Location | Performance | Time |
|---|---|---|
| in memory | \(10~GB \times (250~\mu s / 1~MB)\) | 2.5 seconds |
| on disk (SSD) | \(10~GB \times (1~ms / 1~MB)\) | 10 seconds |
| on disk (HD) | \(10~GB \times (20~ms / 1~MB)\) | 200 seconds |
This is just for reading the sequential data, if we make any modifications (writing) or the data is fragmented things are much worse.
Disk << SSD <<< Memory
Disk <<< SSD << Memory
Disk is cheap but slow; memory is fast but limited.
What do we do when our data doesn’t fit in memory?
Create blocks — group related rows and read multiple rows at a time, loading only what’s needed (skip what we dont want).
Optimal block size depends on the task and the storage medium.
Blocks reduce I/O, but any query still requires scanning all of them — a linear search requiring \(\mathcal{O}(N)\) reads.
We can do better if we are careful about how we structure our data, specifically sorting some (or all) of the columns.
Sorting is expensive, \(\mathcal{O}(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks, \(\mathcal{O}(\log N)\)
In databases these “sorted” columns are referred to as indexes.
Indexes require additional storage, but are usually small enough to be kept in memory even when the blocks need to stay on disk.
Implementing blocks, indexes, and query optimization correctly is hard:
The optimal strategy depends on your queries — and they change
Data is modified (inserts, updates, deletes) — structures must stay consistent
Multiple users may access data simultaneously — concurrency must be managed
Failures happen — data must remain correct and recoverable
A database management system (DBMS) handles all of this for you, plus provides a standard query language (SQL) to express what you want, not how to get it.
SQLite is a lightweight, serverless, relational database engine:
Self-contained — the entire database lives in a single file (or in memory)
No separate server process — the library runs inside your application
Accessible directly via a CLI (sqlite3) or through language bindings (R, Python, etc.)
Full SQL support — joins, indexes, transactions, views, and more
Ubiquitous — embedded in browsers, phones, operating systems, and countless apps — likely the most widely deployed database engine in the world
It is not designed for high-concurrency or multi-user write workloads, but is an excellent tool for learning SQL and for single-user / embedded use cases.
The DBI package provides a consistent R interface to most relational database. Rather than learning a different API for each database, DBI gives you one set of functions for:
RSQLite provides the backend needed to use DBI with SQLite databases. Postgres, MySQL, DuckDB, and many others follow similar DBI patterns.
RSQLite re-exports DBI functions so we do not need to load DBI as well.
Once loaded, create a connection to your database:
Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
..@ ptr :<externalptr>
..@ dbname : chr "employees.sqlite"
..@ loadable.extensions: logi TRUE
..@ flags : int 70
..@ vfs : chr ""
..@ ref :<environment: 0x10a38e720>
..@ bigint : chr "integer64"
..@ extended_types : logi FALSE
Let’s walk through DBI operations using a simple employee table.
employees = tibble(
name = c("Alice","Bob","Carol","Dave","Eve","Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@company.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting","Sales",
"Accounting","Sales","Sales"),
)
Database queries are transactional (see ACID) and follow the steps:
dbGetQuery() combines all the preceeding steps:
name email salary dept
1 Alice alice@company.com 52000 Accounting
2 Bob bob@company.com 40000 Accounting
3 Carol carol@company.com 30000 Sales
4 Dave dave@company.com 33000 Accounting
5 Eve eve@company.com 44000 Sales
6 Frank frank@company.com 37000 Sales
dbCreateTable() will create a new table with a schema based on an existing data.frame / tibble, but it does not populate that table with data.
Use dbAppendTable() to add data to an existing table.
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
We can explicitly close the connection to the database when we’re done with it, but R will automatically close it when the connection object is garbage collected.
We will now create a new SQLite database,
and copy data using dplyr’s convenience functions
# Source: table<`flights`> [?? x 19]
# Database: sqlite 3.51.2 [flights.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dbl>
All of this data now lives in the database on the filesystem not in memory,
flight_tbl?List of 2
$ src :List of 2
..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
.. .. ..@ ptr :<externalptr>
.. .. ..@ dbname : chr "flights.sqlite"
.. .. ..@ loadable.extensions: logi TRUE
.. .. ..@ flags : int 70
.. .. ..@ vfs : chr ""
.. .. ..@ ref :<environment: 0x109b39c18>
.. .. ..@ bigint : chr "integer64"
.. .. ..@ extended_types : logi FALSE
..$ disco: NULL
..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
$ lazy_query:List of 5
..$ x : 'dbplyr_table_path' chr "`flights`"
..$ vars : chr [1:19] "year" "month" "day" "dep_time" ...
..$ group_vars: chr(0)
..$ order_vars: NULL
..$ frame : NULL
..- attr(*, "class")= chr [1:3] "lazy_base_remote_query" "lazy_base_query" "lazy_query"
- attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
If we have a database with an existing table, we can create a reference to it using dplyr::tbl(),
# Source: table<`flights`> [?? x 19]
# Database: sqlite 3.51.2 [flights.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dbl>
Just like with local data frames, dplyr can be used to manipulate data in the database.
# Source: SQL [?? x 3]
# Database: sqlite 3.51.2 [flights.sqlite]
origin dest tailnum
<chr> <chr> <chr>
1 EWR CLT N152UW
2 EWR IAH N535UA
3 JFK MIA N5BSAA
4 JFK SJU N531JB
5 JFK BQN N827JB
6 LGA IAH N15710
7 JFK IAD N825AS
8 EWR TPA N802UA
9 LGA ATL N996DL
10 JFK FLL N627JB
# ℹ more rows
dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working / makes sense.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. showing output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like.
we can force evaluation via compute(), collect(), or collapse()
show_query() - dplyr to SQLIn general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL.
dbplyr has the function, translate_sql(), that lets you experiment with how R functions are translated to SQL.
<SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)
<SQL> (POWER(`x`, 2.0)) < 10.0
<SQL> (`x` % 2.0) = 10.0
Error in `sd()`:
! `sd()` is not available in this SQL variant.
<SQL> CONCAT_WS(' ', `x`, `y`)
Warning: Windowed expression `SUM(`x`)` does not have explicit order.
ℹ Please use `arrange()` or `window_order()` to make deterministic.
<SQL> SUM(`x`) OVER (ROWS UNBOUNDED PRECEDING)
<SQL> LAG(`x`, 1, NULL) OVER ()
By default dbplyr::translate_sql() will translate R / dplyr code into ANSI SQL, if we want to see results specific to a certain database we can pass in a connection object,
<SQL> STDEV(`x`) OVER ()
<SQL> `x` || ' ' || `y`
Warning: Windowed expression `SUM(`x`)` does not have explicit order.
ℹ Please use `arrange()` or `window_order()` to make deterministic.
<SQL> SUM(`x`) OVER (ROWS UNBOUNDED PRECEDING)
<SQL> LAG(`x`, 1, NULL) OVER ()
Not all R functions have a translation to SQL, and not all translations are perfect - when dbplyr encounters a function it doesn’t know how to translate, it will include the function verbatim and hope for the best.
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! no such function: grepl
Join verbs (left_join(), inner_join(), etc.) work with database tables exactly as they do with data frames:
# Source: SQL [?? x 3]
# Database: sqlite 3.51.2 [flights.sqlite]
# Ordered by: desc(n)
name origin n
<chr> <chr> <int>
1 United Air Lines Inc. EWR 46087
2 ExpressJet Airlines Inc. EWR 43939
3 JetBlue Airways JFK 42076
4 Delta Air Lines Inc. LGA 23067
5 Delta Air Lines Inc. JFK 20701
6 Envoy Air LGA 16928
7 American Airlines Inc. LGA 15459
8 Endeavor Air Inc. JFK 14651
9 American Airlines Inc. JFK 13783
10 US Airways Inc. LGA 13136
# ℹ more rows
<SQL>
SELECT `name`, `origin`, COUNT(*) AS `n`
FROM (
SELECT `flights`.*, `name`
FROM `flights`
LEFT JOIN `airlines`
ON (`flights`.`carrier` = `airlines`.`carrier`)
) AS `q01`
GROUP BY `name`, `origin`
ORDER BY `n` DESC
All dplyr operations on a database table are translated to SQL and executed by the database — no R computation happens until you collect().
This means you cannot mix local objects with database tables:
Error in `auto_copy()`:
! `x` and `y` must share the same src.
ℹ `x` is a <tbl_SQLiteConnection/tbl_dbi/tbl_sql/tbl_lazy/tbl>
object.
ℹ `y` is a <data.frame> object.
ℹ Set `copy = TRUE` if `y` can be copied to the same source as `x`
(may be slow).
To work around this, either collect() the database table first, or copy the local object into the database:
# Source: SQL [?? x 19]
# Database: sqlite 3.51.2 [flights.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 542 540 2 923
2 2013 1 1 544 545 -1 1004
3 2013 1 1 557 600 -3 838
4 2013 1 1 558 600 -2 849
5 2013 1 1 558 600 -2 853
6 2013 1 1 558 600 -2 924
7 2013 1 1 559 559 0 702
8 2013 1 1 606 610 -4 837
9 2013 1 1 611 600 11 945
10 2013 1 1 613 610 3 925
# ℹ more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dbl>
Window functions compute a value for each row using values from a surrounding partition. Within a group_by() + mutate(), dplyr translates these to SQL OVER() clauses:
# Source: SQL [?? x 4]
# Database: sqlite 3.51.2 [flights.sqlite]
# Groups: origin
# Ordered by: origin, delay_rank
origin dest dep_delay delay_rank
<chr> <chr> <dbl> <int>
1 EWR ORD 1126 1
2 EWR MIA 896 2
3 EWR ORD 878 3
4 JFK HNL 1301 1
5 JFK CMH 1137 2
6 JFK SFO 1014 3
7 LGA MSP 911 1
8 LGA ATL 898 2
9 LGA DEN 853 3
<SQL>
SELECT `origin`, `dest`, `dep_delay`, `delay_rank`
FROM (
SELECT
`flights`.*,
CASE
WHEN (NOT((`dep_delay` IS NULL))) THEN RANK() OVER (PARTITION BY `origin`, (CASE WHEN ((`dep_delay` IS NULL)) THEN 1 ELSE 0 END) ORDER BY `dep_delay` DESC)
END AS `delay_rank`
FROM `flights`
WHERE (NOT((`dep_delay` IS NULL)))
) AS `q01`
WHERE (`delay_rank` <= 3.0)
ORDER BY `origin`, `delay_rank`
There are two packages that implement this in R which take different approaches,
# A tibble: 181 × 3
origin dest n
<chr> <chr> <int>
1 EWR ATL 15
2 EWR AUS 3
3 EWR AVL 1
4 EWR BNA 7
5 EWR BOS 17
6 EWR BTV 3
7 EWR BUF 2
8 EWR BWI 1
9 EWR CHS 4
10 EWR CLE 4
# ℹ 171 more rows
# A tibble: 181 × 3
origin dest n
<chr> <chr> <int>
1 JFK LAX 32
2 LGA ORD 31
3 LGA ATL 30
4 JFK SFO 24
5 LGA CLT 22
6 EWR ORD 18
7 EWR SFO 18
8 EWR BOS 17
9 LGA MIA 17
10 EWR LAX 16
# ℹ 171 more rows
Warning in fun(libname, pkgname): no display name and no $DISPLAY
environment variable
origin dest n
1 EWR ATL 15
2 EWR AUS 3
3 EWR AVL 1
4 EWR BNA 7
5 EWR BOS 17
6 EWR BTV 3
7 EWR BUF 2
8 EWR BWI 1
9 EWR CHS 4
10 EWR CLE 4
11 EWR CLT 15
12 EWR CMH 3
13 EWR CVG 9
14 EWR DAY 4
15 EWR DCA 3
16 EWR DEN 8
17 EWR DFW 9
18 EWR DSM 2
19 EWR DTW 10
20 EWR FLL 10
21 EWR GRR 2
22 EWR GSO 4
23 EWR GSP 2
24 EWR HNL 1
25 EWR HOU 3
26 EWR IAD 5
27 EWR IAH 11
28 EWR IND 5
29 EWR JAX 4
30 EWR LAS 6
31 EWR LAX 16
32 EWR MCI 4
33 EWR MCO 13
34 EWR MDW 6
35 EWR MEM 3
36 EWR MHT 3
37 EWR MIA 7
38 EWR MKE 3
39 EWR MSN 1
40 EWR MSP 10
41 EWR MSY 4
42 EWR OKC 1
43 EWR OMA 2
44 EWR ORD 18
45 EWR ORF 1
46 EWR PBI 5
47 EWR PDX 2
48 EWR PHX 7
49 EWR PIT 1
50 EWR PVD 1
51 EWR PWM 1
52 EWR RDU 4
53 EWR RIC 5
54 EWR RSW 3
55 EWR SAN 3
56 EWR SAT 1
57 EWR SAV 2
58 EWR SDF 3
59 EWR SEA 5
60 EWR SFO 18
61 EWR SJU 2
62 EWR SLC 1
63 EWR SNA 3
64 EWR STL 8
65 EWR TPA 6
66 EWR TUL 1
67 EWR TYS 1
68 EWR XNA 1
69 JFK ABQ 1
70 JFK ATL 5
71 JFK AUS 4
72 JFK BNA 2
73 JFK BOS 16
74 JFK BQN 1
75 JFK BTV 4
76 JFK BUF 12
77 JFK BUR 1
78 JFK BWI 3
79 JFK CHS 3
80 JFK CLE 1
81 JFK CLT 8
82 JFK CMH 2
83 JFK CVG 3
84 JFK DCA 9
85 JFK DEN 2
86 JFK DFW 2
87 JFK DTW 3
88 JFK FLL 9
89 JFK HNL 1
90 JFK HOU 2
91 JFK IAD 7
92 JFK IAH 1
93 JFK IND 2
94 JFK JAX 3
95 JFK LAS 12
96 JFK LAX 32
97 JFK LGB 2
98 JFK MCI 1
99 JFK MCO 14
100 JFK MIA 9
101 JFK MSP 3
102 JFK MSY 5
103 JFK OAK 1
104 JFK ORD 7
105 JFK ORF 2
106 JFK PBI 4
107 JFK PDX 2
108 JFK PHL 2
109 JFK PHX 6
110 JFK PIT 3
111 JFK PSE 1
112 JFK PWM 4
113 JFK RDU 9
114 JFK ROC 4
115 JFK RSW 2
116 JFK SAN 5
117 JFK SAT 1
118 JFK SEA 7
119 JFK SFO 24
120 JFK SJC 1
121 JFK SJU 11
122 JFK SLC 6
123 JFK SMF 1
124 JFK SRQ 1
125 JFK SYR 4
126 JFK TPA 8
127 LGA ATL 30
128 LGA BGR 2
129 LGA BHM 1
130 LGA BNA 11
131 LGA BOS 16
132 LGA BTV 1
133 LGA BUF 2
134 LGA CAK 2
135 LGA CHS 3
136 LGA CLE 6
137 LGA CLT 22
138 LGA CMH 7
139 LGA CVG 1
140 LGA DAY 1
141 LGA DCA 16
142 LGA DEN 11
143 LGA DFW 14
144 LGA DSM 1
145 LGA DTW 15
146 LGA FLL 10
147 LGA GSO 4
148 LGA GSP 1
149 LGA HOU 2
150 LGA IAD 6
151 LGA IAH 9
152 LGA ILM 1
153 LGA IND 1
154 LGA JAX 2
155 LGA MCI 1
156 LGA MCO 10
157 LGA MDW 6
158 LGA MEM 3
159 LGA MIA 17
160 LGA MKE 5
161 LGA MSN 2
162 LGA MSP 11
163 LGA MSY 3
164 LGA OMA 1
165 LGA ORD 31
166 LGA ORF 1
167 LGA PBI 6
168 LGA PHL 2
169 LGA PIT 5
170 LGA RDU 11
171 LGA RIC 3
172 LGA ROC 2
173 LGA RSW 2
174 LGA SAV 1
175 LGA SDF 1
176 LGA SRQ 2
177 LGA STL 6
178 LGA SYR 1
179 LGA TPA 6
180 LGA TYS 1
181 LGA XNA 3
# A tibble: 181 × 3
origin dest n
<chr> <chr> <int>
1 JFK LAX 32
2 LGA ORD 31
3 LGA ATL 30
4 JFK SFO 24
5 LGA CLT 22
6 EWR ORD 18
7 EWR SFO 18
8 EWR BOS 17
9 LGA MIA 17
10 EWR LAX 16
# ℹ 171 more rows
dplyr / dbplyr translates R expressions into SQL SELECT queries — it is a read-only interface:
| Operation | dplyr / dbplyr | DBI |
|---|---|---|
| Query / filter / summarize | ✓ | ✓ |
| Insert rows | ✗ | ✓ (dbAppendTable(), dbExecute()) |
| Update rows | ✗ | ✓ (dbExecute()) |
| Delete rows | ✗ | ✓ (dbExecute()) |
| Create / drop tables | ✗ | ✓ (dbCreateTable(), dbRemoveTable()) |
For anything beyond reading data, go through DBI directly.
The ability of dplyr to translate from R expression to SQL is an incredibly powerful tool making your data processing workflows portable across a wide variety of data backends.
Some tools and ecosystems that are worth learning about:
Sta 323 - Spring 2026