The data.table library is an R package that provides an enhanced version of R’s builtin data.frame. There are many available resources that introduce data.table’s syntax and provide examples of typical usage scenarios. In this post, we’ll cover one aspect of the data.table API: Special symbols.
Before proceeding, we should highlight data.table’s general syntactical form, that when understood, can be used to translate data.table expressions of arbitrary complexity into 3 steps:
DT[i, j, by]
Which translates to:
Take DT, subset rows using i, then calculate j grouped by by
The following annotated diagram is included as part of data.table’s documentation:
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
For those familiar with SQL, the three expressions that describe the general form can be thought of as components of a SQL query: “SELECT from j, WHERE i, GROUP BY by”. Although the SQL analog of the original interpretation may not hold for every possible data.table operation, it remains a valid proxy for many scenarios encountered in practice. Although this applies to data.table in general, a thorough understanding of the general form will be especially useful when exploring ways to leverage special symbols in your own code.
The data.table special symbols covered here are defined as follows:
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by i, .N is the number of rows in x matched to, for each row of i, regardless of whether nomatch is NA or 0. It is renamed to N (no dot) in the result (otherwise a column called ‘.N’ could conflict with the .N variable, see FAQ 4.6 for more details and example), unless it is explicitly named; e.g.,
DT[,list(total=.N),by=a]
..SD is a data.table containing the Subset of x’s data for each group, excluding any columns used in by (or keyby).
.BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with
if()
depending on the value of a group variable..I is an integer vector equal to
seq_len(nrow(x))
. While grouping, it holds for each item in the group, it’s row location in x. This is useful to subset in j; e.g.DT[, .I[which.max(somecol)], by=grp]
..GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.
.N
The special symbol .N can be used to return the last row or number of rows of a data.table. Referring to the R sample dataset CO2, what follows are a few examples using .N:
> library("data.table")
> DT = CO2; setDT(DT)
> head(DT)
Plant Type Treatment conc uptake1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn1 Quebec nonchilled 350 37.2
5: Qn1 Quebec nonchilled 500 35.3
6: Qn1 Quebec nonchilled 675 39.2
### Return last record of CO2 dataset.
> DT[.N]
Plant Type Treatment conc uptake1: Mc3 Mississippi chilled 1000 19.9
### Return the second-to-last record of CO2 dataset.
> DT[.N-1]
Plant Type Treatment conc uptake1: Mc3 Mississippi chilled 675 18.9
### Return the number of records in CO2 dataset.
> DT[,.N]
1] 84
[
### Return the number of records for each unique value of Treatment.
> DT[, .N, by=Treatment]
Treatment N1: nonchilled 42
2: chilled 42
### Return the number of records for each unique combination of Treatment` & Plant.
> DT[, .N, keyby="Treatment,Plant"]
Treatment Plant N1: nonchilled Qn1 7
2: nonchilled Qn2 7
3: nonchilled Qn3 7
4: nonchilled Mn3 7
5: nonchilled Mn2 7
6: nonchilled Mn1 7
7: chilled Qc1 7
8: chilled Qc3 7
9: chilled Qc2 7
10: chilled Mc2 7
11: chilled Mc3 7
12: chilled Mc1 7
.N can be used to enumerate records, creating a row counter field:
> (DT[,CNTR:=1:.N])
Plant Type Treatment conc uptake CNTR1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 2
3: Qn1 Quebec nonchilled 250 34.8 3
4: Qn1 Quebec nonchilled 350 37.2 4
5: Qn1 Quebec nonchilled 500 35.3 5
.
.
.79: Mc3 Mississippi chilled 175 18.0 79
80: Mc3 Mississippi chilled 250 17.9 80
81: Mc3 Mississippi chilled 350 17.9 81
82: Mc3 Mississippi chilled 500 17.9 82
83: Mc3 Mississippi chilled 675 18.9 83
84: Mc3 Mississippi chilled 1000 19.9 84
Similiarly, enumeration can be applied by group:
CNTR:=1:.N, by=Plant])
(DT[,
Plant Type Treatment conc uptake CNTR1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 2
3: Qn1 Quebec nonchilled 250 34.8 3
4: Qn1 Quebec nonchilled 350 37.2 4
5: Qn1 Quebec nonchilled 500 35.3 5
6: Qn1 Quebec nonchilled 675 39.2 6
7: Qn1 Quebec nonchilled 1000 39.7 7
8: Qn2 Quebec nonchilled 95 13.6 1
9: Qn2 Quebec nonchilled 175 27.3 2
10: Qn2 Quebec nonchilled 250 37.1 3
11: Qn2 Quebec nonchilled 350 41.8 4
12: Qn2 Quebec nonchilled 500 40.6 5
13: Qn2 Quebec nonchilled 675 41.4 6
14: Qn2 Quebec nonchilled 1000 44.3 7
.SD
.SD (subset of x’s data) contains all columns of the data.table except the grouping columns by default. Referring again to the CO2 dataset, we can use .SD to print the contents of a data.table:
> library("data.table")
> DT = CO2; setDT(DT)
> DT[,.SD]
Plant Type Treatment conc uptake1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn1 Quebec nonchilled 350 37.2
.
.
.83: Mc3 Mississippi chilled 675 18.9
84: Mc3 Mississippi chilled 1000 19.9
Although uncommon, .SD can be used for sub-setting columns. Next, we retrieve Treatment and conc from the CO2 dataset using three approaches, demonstrating each method returns an identical subset:
# Using .SD to retrieve Treatment & conc.
> DT1 = DT[, .SD, .SDcols=c("Treatment", "conc")]
> head(DT1)
Treatment conc1: nonchilled 95
2: nonchilled 175
3: nonchilled 250
4: nonchilled 350
5: nonchilled 500
6: nonchilled 675
> DT2 = DT[, .(Treatment, conc)]
# If column names are available only as a character vector,
# i.e., c("Treatment", "conc") include `with=FALSE`.
> DT3 = DT[, c("Treatment", "conc"), with=FALSE]
# Test for equality amongst DT1, DT2 & DT3.
> identical(DT1, DT3); identical(DT2, DT3)
1] TRUE
[1] TRUE [
.SD can be used to retrieve the first 3 rows for all fields partitioned by Plant:
> (DT1 = DT[,.SD[1:3], by=Plant])
Plant Type Treatment conc uptake1: Qn1 Quebec nonchilled 95 16.0
2: Qn1 Quebec nonchilled 175 30.4
3: Qn1 Quebec nonchilled 250 34.8
4: Qn2 Quebec nonchilled 95 13.6
5: Qn2 Quebec nonchilled 175 27.3
6: Qn2 Quebec nonchilled 250 37.1
7: Qn3 Quebec nonchilled 95 16.2
8: Qn3 Quebec nonchilled 175 32.4
9: Qn3 Quebec nonchilled 250 40.3
.
.
.28: Mc1 Mississippi chilled 95 10.5
29: Mc1 Mississippi chilled 175 14.9
30: Mc1 Mississippi chilled 250 18.1
31: Mc2 Mississippi chilled 95 7.7
32: Mc2 Mississippi chilled 175 11.4
33: Mc2 Mississippi chilled 250 12.3
34: Mc3 Mississippi chilled 95 10.6
35: Mc3 Mississippi chilled 175 18.0
36: Mc3 Mississippi chilled 250 17.9
It is possible to retrieve the 1st, 2nd, second-to-last and last records for each unique value present in Type:
> DT[, .SD[c(1, 2, .N-1, .N)], by=Type]
Type Plant Treatment conc uptake1: Quebec Qn1 nonchilled 95 16.0
2: Quebec Qn1 nonchilled 175 30.4
3: Quebec Qc3 chilled 675 39.6
4: Quebec Qc3 chilled 1000 41.4
5: Mississippi Mn1 nonchilled 95 10.6
6: Mississippi Mn1 nonchilled 175 19.2
7: Mississippi Mc3 chilled 675 18.9
8: Mississippi Mc3 chilled 1000 19.9
.SD can be used along with .SDcols to convert fields with a factor datatype to character type. .SDcols is frequently used with .SD to specify a subset of the columns of .SD to be used in j. If .SDcols is present, it is generally bound to a vector of field names, and the operation in j will be applied only to the fields associated with the names specified in .SDcols:
### First, list original field datatypes.
> lapply(DT, class)
$Plant
1] "ordered" "factor"
[
$Type
1] "factor"
[
$Treatment
1] "factor"
[
$conc
1] "numeric"
[
$uptake
1] "numeric"
[
### Convert fields with factor datatype (Plant, Type & Treatment)
### to character type. Isolate factor fieldnames.
> factorFieldnames = names(Filter(is.factor, DT))
### Convert factor datatypes to character types.
> DT[,(factorFieldnames):=lapply(.SD, as.character), .SDcols=factorFieldnames]
### Verify datatypes have been successfully coerced.
> lapply(DT, class)
$Plant
1] "character"
[
$Type
1] "character"
[
$Treatment
1] "character"
[
$conc
1] "numeric"
[
$uptake
1] "numeric" [
Two things to note: First, because we used the assignment by reference operator, :=
, to recast data types, it is not necessary to re-assign the result of the type-coercion back to DT. This is because when using :=
, operations are performed in-place and by reference, therefore the result of a particular action are visible immediately in DT. Second, since factorFieldnames
is enclosed by parentheses preceding :=
, the result is assigned to the columns specified in factorFieldnames
. If parentheses we left out, a new field named factorFieldnames would be added to DT, which is not the desired behavior.
.SD can be used to calculate the average of any fields of interest overall or by group:
### Calculate the overall average of conc & uptake.
> (avgDT1 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake")])
conc uptake1: 435 27.2131
### Calculate the average of conc & uptake by Plant.
> (avgDT2 = DT[, lapply(.SD, mean), .SDcols=c("conc", "uptake"), by=Plant])
Plant conc uptake1: Qn1 435 33.22857
2: Qn2 435 35.15714
3: Qn3 435 37.61429
4: Qc1 435 29.97143
5: Qc2 435 32.70000
6: Qc3 435 32.58571
7: Mn1 435 26.40000
8: Mn2 435 27.34286
9: Mn3 435 24.11429
10: Mc1 435 18.00000
11: Mc2 435 12.14286
12: Mc3 435 17.30000
One final example using .SD: As mentioned earlier in the post, .SDcols is used to specify the columns of the data.table that .SD will operate on. Alternatively, we can use .SDcols to exclude fields from the operation performed by .SD. Referring to the example in which we calculated the average of conc & uptake in the CO2 dataset, we can instead specify the fields .SD should not average over. Here’s an alternative approach that results in excluding Plant, Type and Treatment from the average calculation:
= CO2; setDT(DT)
DT
# Calculate the average of conc & uptake by excluding Plant, Type & Treatment.
> (avgDT = DT[, lapply(.SD, mean), .SDcols=!c("Plant", "Type", "Treatment")]))
conc uptake1: 435 27.2131
Clearly in this example it is not advantageous to specify the fields not included in the average, but situations do arise in which this functionality turns out to be very convenient.
.BY
.BY is a list containing a length 1 vector for each item in by. There are few examples in which the .BY special symbol is used, so we’ll reproduce the example included in the data.table vignette:
library("data.table")
> DT = data.table(
x=rep(c("b", "a", "c"), each=3),
v=c(1, 1, 1, 2, 2, 1, 1, 2, 2),
y=c(1, 3, 6),
a=1:9,
b=9:1
)
> head(DT)
x v y a b1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
> X = data.table(
x=c("c","b"),
v=8:7,
foo=c(4,2)
)
> head(X)
x v foo1: c 8 4
2: b 7 2
# Join within each group.
> X[, DT[.BY, y, on="x"], by=x]
x V11: c 1
2: c 3
3: c 6
4: b 1
5: b 3
6: b 6
.I
.I is an integer vector equal to seq_len(nrow(x))
. But we can instead use .I to return a vector containing the row indices of the records in a data.table that meet the condition specified in i. In the next example, we demonstrate how to return the row indices from CO2 having conc>500
using .I:
### Return row indicies in which conc > 500.
> DT[,.I[conc > 500]]
1] 6 7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84 [
Using .I after the condition specified in i returns a vector of length equal to the number of records meeting the i’s criteria:
> DT[conc > 500, .I]
1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [
Note that use of .I may have been supplanted in favor of data.table’s which parameter. Instead of using .I, we can use which=TRUE
to return the row indices of records meeting the condition specified in i:
> DT[conc > 500, which=TRUE]
1] 6 7 13 14 20 21 27 28 34 35 41 42 48 49 55 56 62 63 69 70 76 77 83 84 [
.GRP
.GRP is an integer, length 1, containing a simple group counter. If used without assignment to a new field, .GRP will return an enumerated data.table containing the unique combination of fields specified in by. If we specify the key of the CO2 dataset as “Plant”, “Type”, “Treatment”, we can return the unique combinations of these fields by referencing the data.table’s key in the by clause:
### Set key on DT.
> setkeyv(DT, c("Plant", "Type", "Treatment"))
### Return enumerated data.table listing unique combinations of fields specified in by.
> DT[, .GRP, by=key(DT)])
Plant Type Treatment GRP1: Qn1 Quebec nonchilled 1
2: Qn2 Quebec nonchilled 2
3: Qn3 Quebec nonchilled 3
4: Qc1 Quebec chilled 4
5: Qc3 Quebec chilled 5
6: Qc2 Quebec chilled 6
7: Mn3 Mississippi nonchilled 7
8: Mn2 Mississippi nonchilled 8
9: Mn1 Mississippi nonchilled 9
10: Mc2 Mississippi chilled 10
11: Mc3 Mississippi chilled 11
12: Mc1 Mississippi chilled 12
Notice that fields not specified in by are omitted from the output.
We can also use .GRP to indicate which unique group a record belongs to for each record in the parent data.table. We use the :=
operator to define a new column (by reference) indicating each record’s associated group:
> (DT[,GROUP_ID:=.GRP, by=key(DT)])
Plant Type Treatment conc uptake GROUP_ID1: Qn1 Quebec nonchilled 95 16.0 1
2: Qn1 Quebec nonchilled 175 30.4 1
3: Qn1 Quebec nonchilled 250 34.8 1
4: Qn1 Quebec nonchilled 350 37.2 1
5: Qn1 Quebec nonchilled 500 35.3 1
6: Qn1 Quebec nonchilled 675 39.2 1
7: Qn1 Quebec nonchilled 1000 39.7 1
8: Qn2 Quebec nonchilled 95 13.6 2
9: Qn2 Quebec nonchilled 175 27.3 2
10: Qn2 Quebec nonchilled 250 37.1 2
.
.
.79: Mc1 Mississippi chilled 175 14.9 12
80: Mc1 Mississippi chilled 250 18.1 12
81: Mc1 Mississippi chilled 350 18.9 12
82: Mc1 Mississippi chilled 500 19.5 12
83: Mc1 Mississippi chilled 675 22.2 12
84: Mc1 Mississippi chilled 1000 21.9 12