Almost all joins between 2 data.tables use a notation
where one of them is used as i in a frame applied to the
other, and the joining columns are specified with the on
parameter. However, in addition to the “basic” joins,
data.table allows for special cases like rolling joins,
summarizing while joining, non-equi joins, etc. This vignette will
describe the notation to apply these joins with verbs defined in
table.express, which, like the single-table
verbs, build data.table expressions.
We’ll consider most of the dplyr joining verbs in this
section:
inner_joinleft_joinright_joinanti_joinsemi_joinfull_joinA <- data.table::data.table(x = rep(c("b", "a", "c"), each = 3),
y = c(1, 3, 6),
v = 1:9)
B <- data.table::data.table(x = c("c", "b"),
v2 = 8:7,
foo = c(4, 2))
A#> x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9
#> x v2 foo
#> 1: c 8 4
#> 2: b 7 2
The methods defined in table.express accept the
on part of the expression in their ellipsis:
#> x y v v2 foo
#> 1: c 1 7 8 4
#> 2: c 3 8 8 4
#> 3: c 6 9 8 4
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
#> x y v foo
#> 1: c 3 8 4
An important thing to note in the second example above is the order
in which the columns are given, i.e. that v is written
before v2, since the order is relevant for
data.table. We can remember the correct order simply by
looking at which data.table appears first in the
expression, and knowing that said data.table’s columns must
appear first in the on expressions. In this case,
A appears before B, so writing
v2 = v would not work.
In order to maintain consistency in the ordering just described,
left_join automatically swaps expression elements
internally in order to build the expression:
#> .DT_[.DT_0_, on = list(x, v2 = v), nomatch = , mult = , roll = ,
#> rollends = ]
#> x v2 foo y
#> 1: b 1 NA 1
#> 2: b 2 NA 3
#> 3: b 3 NA 6
#> 4: a 4 NA 1
#> 5: a 5 NA 3
#> 6: a 6 NA 6
#> 7: c 7 NA 1
#> 8: c 8 4 3
#> 9: c 9 NA 6
We can also see an extra .DT_0_ pronoun in the
expression. These special .DT_*_ pronouns hold the
different data.tables that have entered the expression so
far, and are automatically assigned to the evaluation’s environment. In
this case, .DT_ holds B and
.DT_0_ holds A.
No additional considerations are required to use
right_join or anti_join:
#> x y v foo
#> 1: c 3 8 4
#> 2: b NA 7 2
#> x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 6 9
A semi_join is essentially a right_join
with nomatch = NULL, and where j is set to
unique(.SD):
#> x y v
#> 1: c 1 7
#> 2: c 3 8
#> 3: c 6 9
#> 4: b 1 1
#> 5: b 3 2
#> 6: b 6 3
Finally, full_join is basically a wrapper for
merge specifying all = TRUE:
#> x y v.x v.y foo
#> 1: a 1 4 NA NA
#> 2: a 3 5 NA NA
#> 3: a 6 6 NA NA
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
#> 7: c 1 7 8 4
#> 8: c 3 8 8 4
#> 9: c 6 9 8 4
Something to keep in mind is that most joins specify i
and on inside the frame, so any subsequent verbs that
specify j, by, or keyby would
still be possible. In order to enable this, many joining verbs have an
.expr parameter that indicates that the expression should
be kept, delaying evaluation until a verb that sets j is
used. This can be useful if only a subset of the joined columns is
desired:
#> x y foo
#> 1: b 1 2
#> 2: b 3 2
#> 3: b 6 2
#> 4: a 1 NA
#> 5: a 3 NA
#> 6: a 6 NA
#> 7: c 1 4
#> 8: c 3 4
#> 9: c 6 4
But, when working lazily, this would require explicit chaining for expressions that should be applied to the joined table:
A %>%
start_expr %>%
left_join(B, x) %>%
chain %>%
group_by(x) %>%
mutate(y = cumsum(y)) %>%
end_expr %>% {
invisible(print(.))
}#> x v foo y i.v
#> 1: b 7 2 1 1
#> 2: b 7 2 4 2
#> 3: b 7 2 10 3
#> 4: a NA NA 1 4
#> 5: a NA NA 4 5
#> 6: a NA NA 10 6
#> 7: c 8 4 1 7
#> 8: c 8 4 4 8
#> 9: c 8 4 10 9
This is particularly important if the selection expressions call any
function, e.g. tidyselect helpers or even :
with non-numerics, because data.table does not support that
kind of syntax for j in the same frame as a join.
A special data.table idiom is described here as an update
join. In order to highlight the fact that it modifies the left-hand
side table by reference, the mutate_join verb is defined in
table.express. Said verb accepts the columns to be added in
its .SDcols parameter, possibly with new names:
#> x y v foo v_from_B
#> 1: b 1 1 2 7
#> 2: b 3 2 2 7
#> 3: b 6 3 2 7
#> 4: a 1 4 NA NA
#> 5: a 3 5 NA NA
#> 6: a 6 6 NA NA
#> 7: c 1 7 4 8
#> 8: c 3 8 4 8
#> 9: c 6 9 4 8
A particularity of this idiom is that the number of rows from the resulting join must match the left-hand side exactly or not at all, so this won’t work:
In these cases, we must either use mult if appropriate,
or specify a summarizing expression in .SDcols:
#> x v foo y
#> 1: c 8 4 1
#> 2: b 7 2 1
#> x v foo y
#> 1: c 8 4 3.333333
#> 2: b 7 2 3.333333
The last example specifies by = .EACHI in the joining
expression.
A nice blog post describing rolling joins can be found at R-bloggers,
so almost the same website and paypal tables
will be used for the examples below. Another short description with
animated depictions can also be found here.
#> name session_start_time session_id
#> 1: Erica 2016-01-04 19:12:00 1
#> 2: Erica 2016-01-04 21:05:00 2
#> 3: Francis 2016-01-02 13:09:00 3
#> 4: Francis 2016-01-03 19:22:00 4
#> 5: Francis 2016-01-08 08:44:00 5
#> 6: Francis 2016-01-08 20:22:00 6
#> 7: Francis 2016-01-10 17:36:00 7
#> 8: Francis 2016-01-15 16:56:00 8
#> 9: Isabel 2016-01-01 11:01:00 9
#> 10: Isabel 2016-01-02 08:59:00 10
#> 11: Isabel 2016-01-05 18:18:00 11
#> 12: Isabel 2016-01-07 19:03:00 12
#> 13: Isabel 2016-01-08 19:01:00 13
#> 14: Sally 2016-01-03 10:00:00 14
#> 15: Vivian 2016-01-01 09:10:00 15
#> 16: Vivian 2016-01-09 02:15:00 16
#> name purchase_time payment_id
#> 1: Erica 2016-01-03 08:02:00 1
#> 2: Francis 2016-01-03 19:28:00 2
#> 3: Francis 2016-01-08 20:33:00 3
#> 4: Francis 2016-01-10 17:46:00 4
#> 5: Isabel 2016-01-08 19:10:00 5
#> 6: Mom 2015-12-02 17:58:00 6
#> 7: Sally 2016-01-03 10:06:00 7
#> 8: Sally 2016-01-03 10:15:00 8
In contrast to the blog post, no join_time is added to
the tables. This is done on purpose in order to show what happens with
the columns that are rolled.
Let’s use a left rolling join to obtain the session_id
that immediately preceded a purchase, if any:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 4 2
#> 3: Francis 2016-01-08 20:33:00 6 3
#> 4: Francis 2016-01-10 17:46:00 7 4
#> 5: Isabel 2016-01-08 19:10:00 13 5
#> 6: Mom 2015-12-02 17:58:00 NA 6
#> 7: Sally 2016-01-03 10:06:00 14 7
#> 8: Sally 2016-01-03 10:15:00 14 8
We can see that the rows returned are from the left-hand side
(paypal), and since neither Mom nor Erica visited the
website before their purchases, their session_id ended as
NA.
The order of the columns in the on expressions is the
same as above. The tricky part is that the
rolled column ended up with the name from the right-hand side, but
keeping the values from the left-hand side. If we “invert” the join, the
result is the same, but the rolled column’s name is now from the
expression’s left-hand side.
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 4 2
#> 3: Francis 2016-01-08 20:33:00 6 3
#> 4: Francis 2016-01-10 17:46:00 7 4
#> 5: Isabel 2016-01-08 19:10:00 13 5
#> 6: Mom 2015-12-02 17:58:00 NA 6
#> 7: Sally 2016-01-03 10:06:00 14 7
#> 8: Sally 2016-01-03 10:15:00 14 8
Note, however, that roll stayed equal to
Inf. This is because even though the column order in the
expressions changed, we could understand the rolling expressions as
follows:
left_join, the rolling column on the left is
purchase_time, so with roll = Inf, the values
from session_start_time are rolled forward onto
purchase_time to find a match while joining.right_join, the rolling column on the right is
purchase_time, so roll must stay as
Inf to keep the same semantics.Now let’s say we want to keep all the rows from website
and find the closest payment_id that occurred
after the visit. This could be expressed as:
#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 NA 1
#> 2: Erica 2016-01-04 21:05:00 NA 2
#> 3: Francis 2016-01-02 13:09:00 2 3
#> 4: Francis 2016-01-03 19:22:00 2 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 NA 8
#> 9: Isabel 2016-01-01 11:01:00 5 9
#> 10: Isabel 2016-01-02 08:59:00 5 10
#> 11: Isabel 2016-01-05 18:18:00 5 11
#> 12: Isabel 2016-01-07 19:03:00 5 12
#> 13: Isabel 2016-01-08 19:01:00 5 13
#> 14: Sally 2016-01-03 10:00:00 7 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
In order to simplify the meaning of rollends a bit, we
could think of it as missing or being a single
TRUE/FALSE. If it’s missing, rolling works
according to the value of roll, otherwise:
rollends = TRUE, the value of roll is
inverted only for those rows that would have no match
otherwise.rollends = FALSE, a matching roll will only occur
if the column’s value falls in a gap with values both before
and after.website %>%
left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = TRUE)#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 1 1
#> 2: Erica 2016-01-04 21:05:00 1 2
#> 3: Francis 2016-01-02 13:09:00 2 3
#> 4: Francis 2016-01-03 19:22:00 2 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 4 8
#> 9: Isabel 2016-01-01 11:01:00 5 9
#> 10: Isabel 2016-01-02 08:59:00 5 10
#> 11: Isabel 2016-01-05 18:18:00 5 11
#> 12: Isabel 2016-01-07 19:03:00 5 12
#> 13: Isabel 2016-01-08 19:01:00 5 13
#> 14: Sally 2016-01-03 10:00:00 7 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
website %>%
left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = FALSE)#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 NA 1
#> 2: Erica 2016-01-04 21:05:00 NA 2
#> 3: Francis 2016-01-02 13:09:00 NA 3
#> 4: Francis 2016-01-03 19:22:00 NA 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 NA 8
#> 9: Isabel 2016-01-01 11:01:00 NA 9
#> 10: Isabel 2016-01-02 08:59:00 NA 10
#> 11: Isabel 2016-01-05 18:18:00 NA 11
#> 12: Isabel 2016-01-07 19:03:00 NA 12
#> 13: Isabel 2016-01-08 19:01:00 NA 13
#> 14: Sally 2016-01-03 10:00:00 NA 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
Vivian’s payment_ids are always NA because
she has never purchased anything. On the other hand, no one except
Francis has visited the website both before and after a purchase.
Non-equi joins are similar to rolling joins, but instead of rolling a single row’s value, they can return several values per row.
Using the same data as before, we could find all the
session_ids that preceded a payment_id, giving
“priority” to paypal’s rows:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 3 2
#> 3: Francis 2016-01-03 19:28:00 4 2
#> 4: Francis 2016-01-08 20:33:00 3 3
#> 5: Francis 2016-01-08 20:33:00 4 3
#> 6: Francis 2016-01-08 20:33:00 5 3
#> 7: Francis 2016-01-08 20:33:00 6 3
#> 8: Francis 2016-01-10 17:46:00 3 4
#> 9: Francis 2016-01-10 17:46:00 4 4
#> 10: Francis 2016-01-10 17:46:00 5 4
#> 11: Francis 2016-01-10 17:46:00 6 4
#> 12: Francis 2016-01-10 17:46:00 7 4
#> 13: Isabel 2016-01-08 19:10:00 9 5
#> 14: Isabel 2016-01-08 19:10:00 10 5
#> 15: Isabel 2016-01-08 19:10:00 11 5
#> 16: Isabel 2016-01-08 19:10:00 12 5
#> 17: Isabel 2016-01-08 19:10:00 13 5
#> 18: Mom 2015-12-02 17:58:00 NA 6
#> 19: Sally 2016-01-03 10:06:00 14 7
#> 20: Sally 2016-01-03 10:15:00 14 8
Priority above simply means that all rows from paypal
are returned, even if they don’t have a match in website.
Even though a column session_start_time appears in the
result, the values contained therein are from paypal’s
purchase_time.
A corresponding right non-equi join would yield the same result, expecting only a different order in the columns that are part of the comparisons:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 3 2
#> 3: Francis 2016-01-03 19:28:00 4 2
#> 4: Francis 2016-01-08 20:33:00 3 3
#> 5: Francis 2016-01-08 20:33:00 4 3
#> 6: Francis 2016-01-08 20:33:00 5 3
#> 7: Francis 2016-01-08 20:33:00 6 3
#> 8: Francis 2016-01-10 17:46:00 3 4
#> 9: Francis 2016-01-10 17:46:00 4 4
#> 10: Francis 2016-01-10 17:46:00 5 4
#> 11: Francis 2016-01-10 17:46:00 6 4
#> 12: Francis 2016-01-10 17:46:00 7 4
#> 13: Isabel 2016-01-08 19:10:00 9 5
#> 14: Isabel 2016-01-08 19:10:00 10 5
#> 15: Isabel 2016-01-08 19:10:00 11 5
#> 16: Isabel 2016-01-08 19:10:00 12 5
#> 17: Isabel 2016-01-08 19:10:00 13 5
#> 18: Mom 2015-12-02 17:58:00 NA 6
#> 19: Sally 2016-01-03 10:06:00 14 7
#> 20: Sally 2016-01-03 10:15:00 14 8
In case a self join were necessary, perhaps while using a rolling or
non-equi join, the way magrittr’s pipe handles the
. outside of nested calls wouldn’t allow calling a joining
verb with . both in x and y. To
work around this, the following verbs default to an eager self join when
y is missing:
full_joinleft_joinmutate_joinAs a somewhat contrived example, we could add a rolling count of
weekly visits per user to the website data introduced above:
website %>%
mutate(window_start = session_start_time - as.difftime(7, units = "days")) %>%
mutate_join(, name, window_start <= session_start_time, session_start_time >= session_start_time,
.SDcols = .(weekly_visits = .N),
.by_each = TRUE) %>%
mutate(window_start = NULL)
print(website)#> name session_start_time session_id weekly_visits
#> 1: Erica 2016-01-04 19:12:00 1 1
#> 2: Erica 2016-01-04 21:05:00 2 2
#> 3: Francis 2016-01-02 13:09:00 3 1
#> 4: Francis 2016-01-03 19:22:00 4 2
#> 5: Francis 2016-01-08 08:44:00 5 3
#> 6: Francis 2016-01-08 20:22:00 6 4
#> 7: Francis 2016-01-10 17:36:00 7 4
#> 8: Francis 2016-01-15 16:56:00 8 3
#> 9: Isabel 2016-01-01 11:01:00 9 1
#> 10: Isabel 2016-01-02 08:59:00 10 2
#> 11: Isabel 2016-01-05 18:18:00 11 3
#> 12: Isabel 2016-01-07 19:03:00 12 4
#> 13: Isabel 2016-01-08 19:01:00 13 4
#> 14: Sally 2016-01-03 10:00:00 14 1
#> 15: Vivian 2016-01-01 09:10:00 15 1
#> 16: Vivian 2016-01-09 02:15:00 16 1