A rolling join refers to a situation in which two or more tables need to be associated, but there isn’t a direct correspondence of values in each table’s key column(s). For example, assume we have a table that represents thresholds in which to group claims based on loss amount:
library("data.table")
= data.table(
DF1 group=c("A", "B", "C", "D", "E"),
loss=c(0, 10000, 20000, 30000, 40000),
stringsAsFactors=FALSE
)
Reviewing the contents of DF1:
group loss
1: A 0
2: B 10000
3: C 20000
4: D 30000
5: E 40000
Let’s also assume we have a table of claims, DF2:
= data.table(
DF2 claimno=paste0("000", 10:20),
loss=c(8101, 15700, 64140, 20000, 11655, 31850, 23680, 41440, 16161, 77000, 4564),
stringsAsFactors=FALSE
)
Reviewing the contents of DF2:
claimno loss
1: 00010 8101
2: 00011 15700
3: 00012 64140
4: 00013 20000
5: 00014 11655
6: 00015 31850
7: 00016 23680
8: 00017 41440
9: 00018 16161
10: 00019 77009
11: 00020 4564
The goal is for each claimno in DF2, assign the corresponding group from DF1
such that loss threshold (from DF1
) is the maximum value less than or equal to loss from DF2. For example, a loss amount of 15700 should be assigned to group B, since 10000 is the maximum loss threshold less than or equal to 15700.
The way many people first attack this problem is to use a deeply nested sequence of ifelse
statements. Something akin to:
DF2[,:=
groupifelse(loss>=0 & loss<10000, "A",
ifelse(loss>=10000 & loss<20000, "B",
ifelse(loss>=20000 & loss<30000, "C",
ifelse(loss>=30000 & loss<40000, "D",
"E"
)
)
)
) ]
Which results in:
claimno loss group
1: 00010 8101 A
2: 00011 15700 B
3: 00012 64140 E
4: 00013 20000 C
5: 00014 11655 B
6: 00015 31850 D
7: 00016 23680 C
8: 00017 41440 E
9: 00018 16161 B
10: 00019 77000 E
11: 00020 4564 A
This solution works, but is not optimal for a number of reasons. First, it’s overly verbose and brittle. If the number of groups changes from 5 to 10 or 15, it becomes necessary to extend the nesting of ifelses by the number of new groups. One should always try to avoid writing code that requires updates in proportion to the size of the input. Perhaps more importantly, this approach has poor runtime performance, which we demonstrate later on.
Rolling Joins
Performing a rolling join in data.table is straightforward. Simply add the roll
modifier within the join expression, specifying either +Inf
(or TRUE
) or -Inf
to specify the direction in which to roll. Sticking with the same DF1 and DF2 from before, we create a new table DF, which represents DF2 along with the target group associated with each claimno:
= DF1[DF2, on="loss", roll=+Inf] DF
Resulting in:
group loss claimno
1: A 8101 00010
2: B 15700 00011
3: E 64140 00012
4: C 20000 00013
5: B 11655 00014
6: D 31850 00015
7: C 23680 00016
8: E 41440 00017
9: B 16161 00018
10: E 77000 00019
11: A 4564 00020
Note that in this example, the key column loss is the same in both tables. If this was not the case, say, threshold in DF1 vs. loss in DF2, one would specify on=c("threshold"="loss")
.
For completeness, let’s see what happens if switch to roll=-Inf
(assume we changed loss to threshold in DF1):
= DF1[DF2, on=c("threshold"="loss"), roll=-Inf] DF
Resulting in:
group threshold claimno
1: B 8101 00010
2: C 15700 00011
3: <NA> 64140 00012
4: C 20000 00013
5: C 11655 00014
6: E 31850 00015
7: D 23680 00016
8: <NA> 41440 00017
9: C 16161 00018
10: <NA> 77000 00019
11: B 4564 00020
Any value in excess of the largest threshold gets set to NA
, and all other claims get set to the minimum threshold from DF1 greater than or equal to loss in DF1.
Performance Comparison
To demonstrate to difference in performance, we generate a new DF2 with one million random claim amounts. We’ll then compare the performance between the naive initial implementation and the rolling join implementation. To make it easier to use with the microbenchmark profiling tool, each implementation is encapsulated within separate functions:
library("data.table")
library("microbenchmark")
= data.table(
DF1 group=c("A", "B", "C", "D", "E"),
loss=c(0, 10000, 20000, 30000, 40000),
stringsAsFactors=FALSE
)
= data.table(
DF2 claimno=formatC(1:1000000, format="d", width=7, flag=0),
loss=rgamma(n=1000000, shape=1, scale=25000),
stringsAsFactors=FALSE
)
# Create copies to operate on for each implementation.
= data.table::copy(DF2)
method1DF = data.table::copy(DF2)
method2DF
= function() {
fmethod1 # First method.
method1DF[,:=
groupifelse(loss>=0 & loss<10000, "A",
ifelse(loss>=10000 & loss<20000, "B",
ifelse(loss>=20000 & loss<30000, "C",
ifelse(loss>=30000 & loss<40000, "D",
"E"
)
)
)
)
]
}
= function() {
fmethod2 # Second method.
= DF1[method2DF, on=c("loss"), roll=+Inf]
DF
}
# Run comparison 10 times, compare max result from each.
microbenchmark(
fmethod1(),
fmethod2(),
times=10
)
The results from microbenchmark are provided below:
Unit: milliseconds
expr min lq mean median uq max neval
fmethod1() 2116.3529 2212.5053 2518.0355 2558.7205 2779.6253 3061.8588 10
fmethod2() 494.8094 536.9963 622.7095 586.1551 677.1586 825.6939 10
In the worst case, the rolling join approach is almost 4 times faster, and as the number of records increases, so does the relative performance improvement between the two methods.