Testing with testthat

Lecture 06

Dr. Colin Rundel

Package checking

`R CMD check` - What it does

R CMD check is CRAN’s comprehensive quality control system that runs dozens of checks:

Package structure - Correct directories, required files (DESCRIPTION, NAMESPACE)
Code syntax - All R code parses without errors
Documentation - All functions documented, all examples run
Dependencies - All used packages exist and in DESCRIPTION
Tests - All test files execute without errors
CRAN policy compliance - Follows all submission guidelines (written and unwritten)

devtools::check() vs R CMD check

devtools::check() is a convenient wrapper around R CMD check:

devtools::check()

R CMD build .
R CMD check packagename_1.0.0.tar.gz

Benefits of devtools::check():

Automatically handles building and checking
Better integrated with RStudio workflow
Cleaner output formatting
Automatically installs package first

Interpreting check output

R CMD check produces three levels of issues:

ERROR 🔴 - Must be fixed before CRAN submission
WARNING 🟡 - Should be fixed (CRAN may reject)
NOTE 📝 - Optional improvements (CRAN usually accepts)

Anything flagged must be addressed in the CRAN submission process.

While the check is running issues are shown inline, and summarized at the end.

GitHub Actions for continuous checking

Set up automated checking with:

usethis::use_github_action("check-standard")

This creates .github/workflows/R-CMD-check.yaml that runs checks on:

Latest R on macOS, Windows, Linux (Ubuntu)
Previous R and R-devel on Linux (Ubuntu)

Package testing

Basic test structure

Package tests live in tests/,

Any R scripts found in the folder will be run when checking
Generally tests “fail” if an error is thrown, warnings are also tracked
Testing with base R alone is possible but not recommended (See Writing R Extensions)
There is functionality for comparing test outputs to expected results, but it has limited functionality
Note that R CMD check also runs all documentation examples (unless explicitly tagged with don’t run)

testthat

testthat fundamentals

testthat is the most widely used testing framework for R packages, it also has excellent integration with RStudio & usethis.

A project can be initialized to use testthat via:

usethis::use_testthat()

This creates the following files and directories:

tests/testthat.R - Entry point for R CMD check
tests/testthat/ - Directory for test files
Adds testthat to DESCRIPTION’s Suggests field

testthat project structure

mypackage/
├── R/
│   └── utils.R
├── tests/
│   ├── testthat.R 
│   └── testthat/
│       └── test-utils.R
└── DESCRIPTION

Test file naming:

Must start with test- or test_
Typically tests test-utils.R map to scripts R/utils.R (use usethis::use_test() to create for current open file)
Can also group related functions: test-data-processing.R
helper*.R, teardown*.R, and setup.R all have special behavior - see Special files
All other files are ignored

testthat script structure

Tests are hierarchically organized:

File - Collection of related tests
Test - Group of related expectations (test_that())
Expectation - Single assertion (expect_equal(), expect_error(), etc.)

test_that("`+` works correctly", {
  expect_equal(`+`(2, 3), 5)
  expect_equal(`+`(0, 0), 0)
  expect_type(`+`(1, 1), "double")
  expect_type(`+`(1L, 1L), "integer")
})

Test passed with 4 successes 🥇.

Running tests

There are multiple ways to execute your package’s tests:

During development:

devtools::test() - Run all tests
devtools::test_file("tests/testthat/test-utils.R") - Run one file
Ctrl/Cmd+Shift+T (RStudio) - Run all tests
Ctrl/Cmd+T (RStudio) - Run tests for current file

From command line:

R CMD check - Runs tests as part of package check
Rscript -e "devtools::test()" - In scripts / GitHub Actions

Core expectation functions

testthat provides many expectation functions for different scenarios.

Equality and identity:

expect_equal(actual, expected)
expect_identical(actual, expected)
expect_true(x)
expect_false(x)

Types and classes:

expect_type(x, "double")
expect_s3_class(df, "data.frame")

Conditions:

expect_error(code, regexp = "...")
expect_warning(code, regexp = "...")
expect_message(code, regexp = "...")

expect_equal() vs expect_identical()

Equal compares values based on a tolerance (not the same as ==), identity check for exact equivalence.

test_that("equality", {
  expect_equal(0.2 + 0.2, 0.4)
  expect_equal(0.1 + 0.2, 0.3)
  expect_equal(1L, 1.0)
})

Test passed with 3 successes 🎊.

test_that("identity", {
  expect_identical(0.2 + 0.2, 0.4)
  expect_identical(0.1 + 0.2, 0.3)
  expect_identical(1L, 1.0)
})

── Failure: identity ─────────────────────────────
Expected `0.1 + 0.2` to be identical to 0.3.
Differences:
Objects equal but not identical
── Failure: identity ─────────────────────────────
Expected 1L to be identical to 1.
Differences:
Objects equal but not identical

Error:
! Test failed with 2 failures and 1
  success.

Testing function outputs

calculate_mean_ci = function(x, conf_level = 0.95) {
  if (length(x) == 0) 
    stop("Cannot calculate CI for empty vector")
  if (any(is.na(x))) 
    stop("Missing values not allowed") 
  
  n = length(x)
  mean_x = mean(x)
  se = sd(x) / sqrt(n)
  t_val = qt((1 + conf_level) / 2, df = n - 1)
  
  c(lower = mean_x - t_val * se, upper = mean_x + t_val * se)
}

Example tests

test_that("calculate_mean_ci works correctly", {
  # Test normal case
  result = calculate_mean_ci(c(1, 2, 3, 4, 5))
  expect_type(result, "double")
  expect_length(result, 2)
  expect_named(result, c("lower", "upper"))
  expect_true(result["lower"] < result["upper"])
  
  # Test with known values
  expect_equal(
    calculate_mean_ci(c(0, 0, 0)), 
    c(lower = 0, upper = 0)
  )
  
  # Test confidence level parameter  
  ci_95 = calculate_mean_ci(c(1, 2, 3), conf_level = 0.95)
  ci_99 = calculate_mean_ci(c(1, 2, 3), conf_level = 0.99)
  expect_true(ci_99["upper"] - ci_99["lower"] > ci_95["upper"] - ci_95["lower"])
})

Test passed with 6 successes 😀.

Testing error conditions

test_that("calculate_mean_ci error cases", {
  # Empty vector should error
  expect_error(
    calculate_mean_ci(numeric(0)), 
    "Cannot calculate CI for empty vector"
  )

  # Missing values should error
  expect_error(
    calculate_mean_ci(c(1, 2, NA)), 
    "Missing values not allowed"
  )
  
  # Invalid confidence level should error
  expect_error(
    calculate_mean_ci(1:5, conf_level = 1.5), 
    "conf_level must be between 0 and 1"
  )
  
  # Single value (edge case to think about)
  expect_error(
    calculate_mean_ci(5)
  )
})

── Warning: calculate_mean_ci error cases ────────
NaNs produced
Backtrace:
    ▆
 1. ├─testthat::expect_error(...)
 2. │ └─testthat:::quasi_capture(...)
 3. │   ├─testthat (local) .capture(...)
 4. │   │ └─base::withCallingHandlers(...)
 5. │   └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
 6. └─global calculate_mean_ci(1:5, conf_level = 1.5)
 7.   └─stats::qt((1 + conf_level)/2, df = n - 1)
── Failure: calculate_mean_ci error cases ────────
Expected `calculate_mean_ci(1:5, conf_level = 1.5)` to throw a error.
── Warning: calculate_mean_ci error cases ────────
NaNs produced
Backtrace:
    ▆
 1. ├─testthat::expect_error(calculate_mean_ci(5))
 2. │ └─testthat:::quasi_capture(...)
 3. │   ├─testthat (local) .capture(...)
 4. │   │ └─base::withCallingHandlers(...)
 5. │   └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
 6. └─global calculate_mean_ci(5)
 7.   └─stats::qt((1 + conf_level)/2, df = n - 1)
── Failure: calculate_mean_ci error cases ────────
Expected `calculate_mean_ci(5)` to throw a error.

Error:
! Test failed with 2 failures and 2
  successes.

Testing for errors

Testing for errors is important, but expect_error() can be dangerous if you don’t check the output. All that the expectation tells you is that some error was thrown, not that it was the right error.

calculate_discount = function(price, discount_percent) {
  if (price < 0) stop("Price cannot be negative")
  if (discount_percent > 100) stop("Discount cannot exceed 100%")
  
  price * (1 - discount_pct / 100)  # Bug: wrong variable name
}

test_that("demonstrates why checking error messages matters", {
  # ✗ passes but for the wrong reason!
  expect_error(calculate_discount(100, -50))
  # ✓ This correctly tests the price validation
  expect_error(calculate_discount(-50, 10), "Price cannot be negative")
})

Test passed with 2 successes 🎉.

calculate_discount(100, -50)

Error in `calculate_discount()`:
! object 'discount_pct' not found

calculate_discount(-50, 10)

Error in `calculate_discount()`:
! Price cannot be negative

In this case the issue would likely be caught by other tests,

test_that("Calculation test", {
  expect_equal(calculate_discount(100, 20), 80)
})

── Error: Calculation test ───────────────────────
Error in `calculate_discount(100, 20)`: object 'discount_pct' not found
Backtrace:
    ▆
 1. ├─testthat::expect_equal(calculate_discount(100, 20), 80)
 2. │ └─testthat::quasi_label(enquo(object), label)
 3. │   └─rlang::eval_bare(expr, quo_get_env(quo))
 4. └─global calculate_discount(100, 20)

Error:
! Test failed with 1 failure and 0
  successes.

Skipping tests

Skip tests when certain conditions aren’t met:

test_that("database connection works", {
  skip_if_not_installed("RPostgreSQL")
  skip_if(Sys.getenv("TEST_DB_URL") == "", "Database URL not set")
  skip_on_cran()
  skip_on_ci()
  
  # Database tests ...
})

test_that("internet-dependent test", {
  skip_if_offline()
  
  # Test that requires internet connection
  result = download_data("https://example.com/api")
  expect_type(result, "list")
})

Snapshot tests

Snapshot tests capture the output of your functions and compare against previously saved results:

First run: Snapshot is created and saved
Subsequent runs: Current output compared against saved snapshot
When output changes: Test fails, you review and accept/reject the change

Snapshot tests are best for:

Error messages and warnings
Complex data structure outputs
Printed output from functions
Any output where exact specification is difficult

expect_snapshot() for output

Test printed output and messages:

print_summary = function(data) {
  cat("Data summary:\n")
  cat("Rows:", nrow(data), "\n")
  cat("Columns:", ncol(data), "\n")
  cat("Column names:", paste(names(data), collapse = ", "), "\n")
}

test_that("print_summary produces consistent output", {
  df = data.frame(x = 1:3, y = letters[1:3])
  
  expect_snapshot({
    print_summary(df)
  })
})

Snapshot output

Creates tests/testthat/_snaps/print-summary.md:

# print_summary produces consistent output

    Code
      print_summary(df)
    Output
      Data summary:
      Rows: 3 
      Columns: 2 
      Column names: x, y

Managing snapshots

Accepting changes:

snapshot_accept()
snapshot_accept("test-myfunction.R")

Reviewing changes

snapshot_review()
snapshot_review("test-myfunction.R")

Some best practices:

Review snapshot changes carefully in code review
Don’t commit snapshot updates without understanding why they changed
Use descriptive test names for easier snapshot identification

Why testing matters

Testing is a fundamental part of creating reliable, maintainable R packages (and code in general):

Catch bugs early - Find problems before they reach users
Document behavior - Tests serve as executable specifications
Prevent regressions - Ensure new changes don’t break existing functionality
Enable refactoring - Change implementation with confidence

Testing as documentation

Well-written tests serve multiple purposes:

test_that("mean() behaves as expected", {
  # Basic calculations
  expect_equal(mean(c(1, 2, 3)), 2)
  expect_equal(mean(c(3, 2, 1)), 2)
  
  # Missing values
  expect_true(is.na(mean(c(1, 2, NA))))
  expect_equal(mean(c(1, 2, NA), na.rm = TRUE), 1.5)
  result = mean(numeric(0))
  expect_true(is.na(result))
  expect_true(is.nan(result))
})

Tests make your intentions clear to future maintainers / contributors (including yourself!)

Code coverage

Code coverage measures the % of your code that is executed during testing.

Useful as a rough indicator of how well tested your code is
The covr package provides coverage tooling for R packages
Most CI services (e.g. GitHub Actions) can track coverage over time

Coverage has important limitations:

Measures execution, not correctness
Does not measure the code your code uses
100% coverage does not mean bug-free code
Incentivizes writing tests that touch lines rather than tests that verify behavior
Edge cases and input validation can be missed even at high coverage

Test-Driven Development

The TDD cycle: Red-Green-Refactor

Test-Driven Development follows a simple cycle:

🔴 Red: Write a failing test for the functionality you want to implement
🟢 Green: Write the minimal code to make the test pass
🔵 Refactor: Clean up the code while keeping tests green
Repeat: Move on to the next piece of functionality

This approach ensures:

You only write code that’s actually needed
Every line of code is covered by tests
Your design is driven by actual usage

TDD example

Let’s implement a is_palindrome() function using TDD:

Step 1 - Write the test(s) first

test_that("is_palindrome works correctly", {
  expect_true(is_palindrome(c(1, 2, 3, 2, 1)))
  expect_true(is_palindrome(c("a", "b", "a")))
  expect_false(is_palindrome(c(1, 2, 3)))
  expect_true(is_palindrome(c(5)))  # Single element
  expect_true(is_palindrome(numeric(0)))  # Empty vector
})

── Error: is_palindrome works correctly ──────────
Error in `is_palindrome(c(1, 2, 3, 2, 1))`: could not find function "is_palindrome"
Backtrace:
    ▆
 1. └─testthat::expect_true(is_palindrome(c(1, 2, 3, 2, 1)))
 2.   └─testthat::quasi_label(enquo(object), label)
 3.     └─rlang::eval_bare(expr, quo_get_env(quo))

Error:
! Test failed with 1 failure and 0
  successes.

Step 2

Write minimal code to pass:

is_palindrome = function(x) {
  all(x == rev(x))
}

Which we then check with our existing tests:

test_that("is_palindrome works correctly", {
  expect_true(is_palindrome(c(1, 2, 3, 2, 1)))
  expect_true(is_palindrome(c("a", "b", "a")))
  expect_false(is_palindrome(c(1, 2, 3)))
  expect_true(is_palindrome(c(5)))  # Single element
  expect_true(is_palindrome(numeric(0)))  # Empty vector
})

Test passed with 5 successes 🎉.

Step 3: Refactor

We can consider a slightly improved implementation:

is_palindrome = function(x) {
  identical(x, rev(x))
}

Which we again verify with the tests:

test_that("is_palindrome works correctly", {
  expect_true(is_palindrome(c(1, 2, 3, 2, 1)))
  expect_true(is_palindrome(c("a", "b", "a")))
  expect_false(is_palindrome(c(1, 2, 3)))
  expect_true(is_palindrome(c(5)))  # Single element
  expect_true(is_palindrome(numeric(0)))  # Empty vector
})

Test passed with 5 successes 😸.

Step 4: Repeat

We can consider additional functionality, such as input validation by expanding our tests:

test_that("is_palindrome errors for non-atomic input", {
  expect_error(is_palindrome(list(1, 2, 1)))
})

── Failure: is_palindrome errors for non-atomic input ──
Expected `is_palindrome(list(1, 2, 1))` to throw a error.

Error:
! Test failed with 1 failure and 0
  successes.

is_palindrome = function(x) {
  stopifnot("Input must be an atomic vector" = is.atomic(x))
  identical(x, rev(x))
}

test_that("is_palindrome errors for non-atomic input", {
  expect_error(is_palindrome(list(1, 2, 1)))
})

Test passed with 1 success 🎉.

TDD in the real world

In practice, TDD may not be followed strictly, but the principles remain valuable:

Tests should guide your design and implementation
Tests should not be an afterthought once your code is “done”
Refactoring is easier and safer with a solid test suite
Writing tests 2nd can lead to missing edge cases / faulty assumptions

Why Packages?

Benefits of packages

Organizing your projects as a package provides many advantages:

Benefit from the existing infrastructure for package development
Easier to share and distribute your code (dependencies, installation, documentation, etc.)
Easier to bundle and document data sets
Better support for testing and documentation
Tends to lead to better organized, modular code and overall better design

Packages and LLMs

We will go into this more on Thursday, but packages are also a great way to structure your code to work with LLMs:

Prescribed structure makes it easier for the LLMs to understand your codebase
Better context management
Better grounding and easier iteration through tests and checks