Subsetting

Lecture 04

Dr. Colin Rundel

Matrices and Arrays

Matrices

R supports the creation of 2d matrix data structures using atomic vector types.

Generally these are formed via a call to matrix().

matrix(1:4, nrow=2, ncol=2)
     [,1] [,2]
[1,]    1    3
[2,]    2    4
matrix(c(TRUE, FALSE), 2, 2)
      [,1]  [,2]
[1,]  TRUE  TRUE
[2,] FALSE FALSE
matrix(LETTERS[1:6], 2)
     [,1] [,2] [,3]
[1,] "A"  "C"  "E" 
[2,] "B"  "D"  "F" 
matrix(6:1 / 2, ncol = 2)
     [,1] [,2]
[1,]  3.0  1.5
[2,]  2.5  1.0
[3,]  2.0  0.5

Data ordering

Matrices in R use column major ordering (data is stored by column).

(m = matrix(1:6, nrow=2, ncol=3))
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
c(m)
[1] 1 2 3 4 5 6
(n = matrix(1:6, nrow=3, ncol=2))
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
c(n)
[1] 1 2 3 4 5 6

We can populate a matrix by row, but the data is still stored by column.

(x = matrix(1:6, nrow=2, ncol=3, byrow = TRUE))
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
c(x)
[1] 1 4 2 5 3 6
(y = matrix(1:6, nrow=3, ncol=2, byrow=TRUE))
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
c(y)
[1] 1 3 5 2 4 6

Matrix structure

m = matrix(1:4, ncol=2, nrow=2)
typeof(m)
[1] "integer"
mode(m)
[1] "numeric"
class(m)
[1] "matrix" "array" 
attributes(m)
$dim
[1] 2 2

Matrices (and arrays) are just atomic vectors with a dim attribute attached. They do not have an explicit class attribute, but have implicit class(es).

n = letters[1:6]
dim(n) = c(2L, 3L)
n
     [,1] [,2] [,3]
[1,] "a"  "c"  "e" 
[2,] "b"  "d"  "f" 
o = letters[1:6]
attr(o,"dim") = c(2L, 3L)
o
     [,1] [,2] [,3]
[1,] "a"  "c"  "e" 
[2,] "b"  "d"  "f" 

Demo - S3 w/ matrices

report = function(x) {
  UseMethod("report")
}
report.default = function(x) {
  paste0("Class ", class(x)," not supported.")
}
report.double = function(x) {
  "I'm a double!"
}
report.numeric = function(x) {
  "I'm a numeric!"
}
report.matrix = function(x) {
  "I'm a matrix!"
}
report.array = function(x) {
  "I'm an array!"
}
#rm(report.double)
#rm(report.numeric)
#rm(report.matrix)
#rm(report.array)

report(matrix(1))

Arrays

Arrays are just an \(n\)-dimensional extension of matrices and are defined by adding the appropriate dimension sizes.

array(1:8, dim = c(2,2,2))
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8
array(letters[1:6], dim = c(1,2,3))
, , 1

     [,1] [,2]
[1,] "a"  "b" 

, , 2

     [,1] [,2]
[1,] "c"  "d" 

, , 3

     [,1] [,2]
[1,] "e"  "f" 

Arrays & class()

A 2d array will have class c("matrix","array") while 1d or >2d will only have class "array"

class(array(1, c(1,1)))
[1] "matrix" "array" 
class(array(1, c(1,1,1)))
[1] "array"
class(array(1, c(1)))
[1] "array"
class(array(1, c(1,1,1,1)))
[1] "array"

Data Frames

Data Frames

A data frame is how R handles heterogeneous tabular data (i.e. a table of rows and columns) and is one of the most commonly used data structure in R.

(df = data.frame(
  x = 1:3, 
  y = c("a", "b", "c"),
  z = c(TRUE)
))
  x y    z
1 1 a TRUE
2 2 b TRUE
3 3 c TRUE
str(df)
'data.frame':   3 obs. of  3 variables:
 $ x: int  1 2 3
 $ y: chr  "a" "b" "c"
 $ z: logi  TRUE TRUE TRUE

Data Frame Structure

R represents data frames using a list of equal length vectors.

typeof(df)
[1] "list"
class(df)
[1] "data.frame"
attributes(df)
$names
[1] "x" "y" "z"

$class
[1] "data.frame"

$row.names
[1] 1 2 3
str(unclass(df))
List of 3
 $ x: int [1:3] 1 2 3
 $ y: chr [1:3] "a" "b" "c"
 $ z: logi [1:3] TRUE TRUE TRUE
 - attr(*, "row.names")= int [1:3] 1 2 3

Build your own data.frame

df = list(x = 1:3, y = c("a", "b", "c"), z = c(TRUE, TRUE, TRUE))
attr(df,"class") = "data.frame"
df
[1] x y z
<0 rows> (or 0-length row.names)
attr(df,"row.names") = 1:3
df
  x y    z
1 1 a TRUE
2 2 b TRUE
3 3 c TRUE
str(df)
'data.frame':   3 obs. of  3 variables:
 $ x: int  1 2 3
 $ y: chr  "a" "b" "c"
 $ z: logi  TRUE TRUE TRUE
is.data.frame(df)
[1] TRUE

Strings (Characters) vs Factors

Previous to R v4.0, the default behavior of data frames was to convert character data into factors. Sometimes this was useful, but mostly it wasn’t.

This behavior is controlled via the stringsAsFactors argument to data.frame (and related functions like read.csv, read.table, etc.).

(df = data.frame(
  x = 1:3, y = c("a", "b", "c"), 
  stringsAsFactors = TRUE))
  x y
1 1 a
2 2 b
3 3 c
str(df)
'data.frame':   3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: Factor w/ 3 levels "a","b","c": 1 2 3
(df = data.frame(
  x = 1:3, y = c("a", "b", "c"), 
  stringsAsFactors = FALSE))
  x y
1 1 a
2 2 b
3 3 c
str(df)
'data.frame':   3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: chr  "a" "b" "c"

Length Coercion

When creating a data frame from different vectors, the lengths of the component vectors will be coerced to match. However, if they are not multiples of each other then there will be an error (other previous forms of length coercion would produce a warning for this case).

data.frame(x = 1:3, y = c("a"))
  x y
1 1 a
2 2 a
3 3 a
data.frame(x = 1:3, y = c("a","b"))
Error in `data.frame()`:
! arguments imply differing number of rows: 3, 2
data.frame(x = 1:3, y = character())
Error in `data.frame()`:
! arguments imply differing number of rows: 3, 0

Subsetting

Subsetting in R

R has three subsetting operators ([, [[, and $). The behavior of these operators depends on the object (class) they are being used with (S3).


In general there are 6 different types of subsetting that can be performed based on the value passed to the operator,

  • Positive integer

  • Negative integer

  • Logical value

  • Empty / NULL

  • Zero valued

  • Character value (names)

Positive Integer subsetting

Returns elements at the given location(s)

x = c(1,4,7)
x[1]
[1] 1
x[c(1,3)]
[1] 1 7
x[c(1,1)]
[1] 1 1
x[c(1.9,2.1)]
[1] 1 4
y = list(1,4,7)
str( y[1] )
List of 1
 $ : num 1
str( y[c(1,3)] )
List of 2
 $ : num 1
 $ : num 7
str( y[c(1,1)] )
List of 2
 $ : num 1
 $ : num 1
str( y[c(1.9,2.1)] )
List of 2
 $ : num 1
 $ : num 4

Negative Integer subsetting

Excludes elements at the given location(s)

x = c(1,4,7)
x[-1]
[1] 4 7
x[-c(1,3)]
[1] 4
x[c(-1,-1)]
[1] 4 7
y = list(1,4,7)
str( y[-1] )
List of 2
 $ : num 4
 $ : num 7
str( y[-c(1,3)] )
List of 1
 $ : num 4
x[c(-1,2)]
Error in `x[c(-1, 2)]`:
! only 0's may be mixed with negative subscripts
y[c(-1,2)]
Error in `y[c(-1, 2)]`:
! only 0's may be mixed with negative subscripts

Logical Value Subsetting

Returns elements that correspond to TRUE in the logical vector. Length of the logical vector is coerced to be the same as the vector being subsetted.

x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
[1]  1  4 12
x[c(TRUE,FALSE)]
[1] 1 7
y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
List of 3
 $ : num 1
 $ : num 4
 $ : num 12
str( y[c(TRUE,FALSE)] )
List of 2
 $ : num 1
 $ : num 7
x[x %% 2 == 0]
[1]  4 12
str( y[y %% 2 == 0] )
Error in `y %% 2`:
! non-numeric argument to binary operator

Empty Subsetting

Returns the original vector, this is not the same as subsetting with NULL

x = c(1,4,7)
x[]
[1] 1 4 7
x[NULL]
numeric(0)
y = list(1,4,7)
str(y[])
List of 3
 $ : num 1
 $ : num 4
 $ : num 7
str(y[NULL])
 list()

Zero subsetting

Returns an empty vector (of the same type), this is the same as subsetting with NULL

x = c(1,4,7)
x[0]
numeric(0)
y = list(1,4,7)
str(y[0])
 list()

0s can be mixed with either positive or negative integers for subsetting, and are ignored in both cases.

x[c(0,1)]
[1] 1
y[c(0,1)]
[[1]]
[1] 1
x[c(0,-1)]
[1] 4 7
y[c(0,-1)]
[[1]]
[1] 4

[[2]]
[1] 7

Character subsetting

If the vector has names, selects 1st element whose names correspond to the value in the names attribute.

x = c(a=1, b=4, c=7)
x["a"]
a 
1 
x[c("a","a")]
a a 
1 1 
x[c("b","c")]
b c 
4 7 
y = list(a=1,b=4,c=7)
str(y["a"])
List of 1
 $ a: num 1
str(y[c("a","a")])
List of 2
 $ a: num 1
 $ a: num 1
str(y[c("b","c")])
List of 2
 $ b: num 4
 $ c: num 7

Out of bounds

x = c(1,4,7)
x[4]
[1] NA
x[-4]
[1] 1 4 7
x["a"]
[1] NA
x[c(1,4)]
[1]  1 NA
y = list(1,4,7)
str(y[4])
List of 1
 $ : NULL
str(y[-4])
List of 3
 $ : num 1
 $ : num 4
 $ : num 7
str(y["a"])
List of 1
 $ : NULL
str(y[c(1,4)])
List of 2
 $ : num 1
 $ : NULL

Missing values

x = c(1,4,7)
x[NA]
[1] NA NA NA
x[c(1,NA)]
[1]  1 NA
y = list(1,4,7)
str(y[NA])
List of 3
 $ : NULL
 $ : NULL
 $ : NULL
str(y[c(1,NA)])
List of 2
 $ : num 1
 $ : NULL

NULL and empty vectors (length 0)

This final type of subsetting follows the rules for length coercion with a 0-length vector (i.e. the vector being subset gets coerced to having length 0 if the subsetting vector has length 0)

x = c(1,4,7)
x[NULL]
numeric(0)
x[integer()]
numeric(0)
x[character()]
numeric(0)
y = list(1,4,7)
y[NULL]
list()
y[integer()]
list()
y[character()]
list()

Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object (in-place).

x = c(1, 4, 7, 9, 10, 15)
x[2] = 2
x
[1]  1  2  7  9 10 15
x %% 2 != 0
[1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE
x[x %% 2 != 0] = (x[x %% 2 != 0] + 1) / 2
x
[1]  1  2  4  5 10  8

More assignment

x[c(1,1)] = c(2,3)
x
[1]  3  2  4  5 10  8
x = 1:6
x[c(2,NA)] = 1
x
[1] 1 1 3 4 5 6
x = 1:6
x[c(-1,-2)] = 3
x
[1] 1 2 3 3 3 3
x = 1:6
x[c(TRUE,NA)] = 1
x
[1] 1 2 1 4 1 6
x = 1:6
x[] = 1:3
x
[1] 1 2 3 1 2 3

The other subset operators
[[ and $

Atomic vectors - [ vs. [[

[[ subsets like [ except it can only subset for a single value

x = c(a=1,b=4,c=7)
x[1]
a 
1 
x[[1]]
[1] 1
x[["a"]]
[1] 1
x[[1:2]]
Error in `x[[1:2]]`:
! attempt to select more than one element in vectorIndex
x[[TRUE]]
[1] 1

Generic Vectors (lists) - [ vs. [[

Subsets a single value, but returns the value - not a list containing that value. Multiple values are interpreted as nested subsetting.

y = list(a=1, b=4, c=7:9)
y[2]
$b
[1] 4
str( y[2] )
List of 1
 $ b: num 4
y[[2]]
[1] 4
y[["b"]]
[1] 4
y[[1:2]]
Error in `y[[1:2]]`:
! subscript out of bounds
y[[2:1]]
[1] 4

Hadley’s Analogy (1)

Hadley’s Analogy (2)

[[ vs. $

$ is equivalent to [[ but it only works for name based subsetting of lists (it also uses partial matching for names)

x = c("abc"=1, "def"=5)
x$abc
Error in `x$abc`:
! $ operator is invalid for atomic vectors
y = list("abc"=1, "def"=5)
y[["abc"]]
[1] 1
y$abc
[1] 1
y$d
[1] 5

A common error

Why does the following code not work?

x = list(abc = 1:10, def = 10:1)
y = "abc"
x[[y]]
 [1]  1  2  3  4  5  6  7  8  9 10
x$y
NULL

The expression x$y gets interpreted as x[["y"]] by R, note the inclusion of the "s, this is not the same as the expression x[[y]].

Subsetting Data Frames

Subsetting rows

As data frames have 2 dimensions, we can subset on either the rows or the columns - the subsetting values are separated by a comma.

(df = data.frame(x = 1:3, y = c("A","B","C"), z = TRUE))
  x y    z
1 1 A TRUE
2 2 B TRUE
3 3 C TRUE
df[1, ]
  x y    z
1 1 A TRUE
str( df[1, ] )
'data.frame':   1 obs. of  3 variables:
 $ x: int 1
 $ y: chr "A"
 $ z: logi TRUE
df[c(1,3), ]
  x y    z
1 1 A TRUE
3 3 C TRUE
str( df[c(1,3), ] )
'data.frame':   2 obs. of  3 variables:
 $ x: int  1 3
 $ y: chr  "A" "C"
 $ z: logi  TRUE TRUE

Subsetting Columns

df
  x y    z
1 1 A TRUE
2 2 B TRUE
3 3 C TRUE
df[, 1]
[1] 1 2 3
str( df[, 1] )
 int [1:3] 1 2 3
df[, 1:2]
  x y
1 1 A
2 2 B
3 3 C
str( df[, 1:2] )
'data.frame':   3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: chr  "A" "B" "C"
df[, -3]
  x y
1 1 A
2 2 B
3 3 C
str( df[, -3] )
'data.frame':   3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: chr  "A" "B" "C"

Subsetting both

df
  x y    z
1 1 A TRUE
2 2 B TRUE
3 3 C TRUE
df[1, 1]
[1] 1
str( df[1, 1] )
 int 1
df[1:2, 1:2]
  x y
1 1 A
2 2 B
str( df[1:2, 1:2] )
'data.frame':   2 obs. of  2 variables:
 $ x: int  1 2
 $ y: chr  "A" "B"
df[-1, 2:3]
  y    z
2 B TRUE
3 C TRUE
str( df[-1, 2:3] )
'data.frame':   2 obs. of  2 variables:
 $ y: chr  "B" "C"
 $ z: logi  TRUE TRUE

Preserving vs Simplifying

Most of the time, R’s [ subset operator is a preserving operator, in that the returned object will always have the same type/class as the object being subset.

Confusingly, when used with some classes (e.g. data frame, matrix or array) [ becomes a simplifying operator (does not preserve type) - this behavior is instead controlled by the drop argument.

Drop w/ row subset

df[1, ]
  x y    z
1 1 A TRUE
str(df[1, ])
'data.frame':   1 obs. of  3 variables:
 $ x: int 1
 $ y: chr "A"
 $ z: logi TRUE
df[1, , drop=TRUE]
$x
[1] 1

$y
[1] "A"

$z
[1] TRUE
str(df[1, , drop=TRUE])
List of 3
 $ x: int 1
 $ y: chr "A"
 $ z: logi TRUE

Drop w/ column subset

df[, 1]
[1] 1 2 3
str(df[, 1])
 int [1:3] 1 2 3
df[, 1, drop=FALSE]
  x
1 1
2 2
3 3
str(df[, 1, drop=FALSE])
'data.frame':   3 obs. of  1 variable:
 $ x: int  1 2 3

Exceptions

drop only works when the resulting value can be represented as a 1d vector (either a list or atomic).

df[1:2, 1:2]
  x y
1 1 A
2 2 B
str(df[1:2, 1:2])
'data.frame':   2 obs. of  2 variables:
 $ x: int  1 2
 $ y: chr  "A" "B"
df[1:2, 1:2, drop=TRUE]
  x y
1 1 A
2 2 B
str(df[1:2, 1:2, drop=TRUE])
'data.frame':   2 obs. of  2 variables:
 $ x: int  1 2
 $ y: chr  "A" "B"

Preserving vs Simplifying Subsets


Type Simplifying Preserving
Atomic Vector x[[1]] x[1]
List x[[1]] x[1]
Matrix / Array x[[1]]
x[1, ]
x[, 1]
x[1, , drop=FALSE]
x[, 1, drop=FALSE]
Factor x[1:4, drop=TRUE] x[1:4]
x[[1]]
Data frame x[, 1]
x[[1]]
x[, 1, drop=FALSE]
x[1]