# PyTorch TensorIterator Internals - 2021 Update

For contributors to the PyTorch codebase, one of the most commonly encountered
C++ classes is
`TensorIterator`

.
`TensorIterator`

offers a standardized way to iterate over elements of
a tensor, automatically parallelizing operations, while abstracting device and
data type details.

In April 2020, Sameer Deshmukh wrote a blog article discussing PyTorch TensorIterator Internals. Recently, however, the interface has changed significantly. This post describes how to use the current interface as of April 2021. Much of the information from the previous article is directly copied here, but with updated API calls and some extra details.

All of the code examples below can be compiled and run in this GitHub repo.

## Basics of TensorIterator and TensorIteratorConfig

In order to create a `TensorIterator`

, a `TensorIteratorConfig`

must be created
first. `TensorIteratorConfig`

specifies the input and output tensors that will
be iterated over, whether all tensors are expected to share the same data type
and device, and a handful of other settings. After setting up the
configuration, we can call `TensorIteratorConfig::build()`

to obtain
a `TensorIterator`

that has the specified settings. `TensorIterator`

is
immutable, so once it is created, its configuration cannot be changed.

In the following example, a tensor named `out`

is configured as the output
tensor and `a`

and `b`

are the input tensors. Calling `build`

creates the
`TensorIterator`

object from the specified configuration.

```
at::TensorIteratorConfig iter_config;
iter_config
.add_output(out)
.add_input(a)
.add_input(b);
auto iter = iter_config.build();
```

## Performing iterations

Iterations using `TensorIterator`

can be classified as point-wise iterations or
reduction iterations. This plays a fundamental role in how iterations using
`TensorIterator`

are parallelized. Point-wise iterations can be freely
parallelized along any dimension and grain size while reduction operations have
to be either parallelized along dimensions that you're not iterating over or by
performing bisect and reduce operations along the dimension being iterated.
Note that for CUDA, it is possible to parallelize along the reduction
dimension, but synchronizations are needed to avoid race conditions.
Parallelization with vectorized operations can also be implemented.

### Iteration details

The simplest iteration operation can be performed using the
`for_each`

function. This function has two overloads: one takes a function object which
iterates over a single dimension (`loop_t`

); the other takes a function object
which iterates over two dimensions simultaneously (`loop2d_t`

). Find their
definitions
here.
The simplest way of using `for_each`

is to pass it a lambda of type `loop_t`

(or `loop2d_t`

).

In the example below, the `char** data`

argument of the `copy_loop`

function
(which is an instance of the `loop_t`

lambda) contains a `char*`

pointer for
each of the tensors, in the order that they are specified in the
`TensorIteratorConfig`

. To make the implementation agnostic of any particular
data type, the pointer is typecast to `char`

, so we can access it as an array
of bytes.

The second argument is `const int64_t* strides`

, which is an array containing
the strides of each tensor in the dimension that you're iterating over. We can
add this stride to the pointer received in order to reach the next element in
the tensor. The last argument is `int64_t n`

which is the size of the dimension
being iterated over.

`for_each`

implicitly parallelizes the operation by executing `copy_loop`

in
parallel if the number of iterations is more than the value of
`internal::GRAIN_SIZE`

, which is a value that is determined as the 'right
amount' of data to iterate over in order to gain a significant speedup using
multi-threaded execution. If you want to explicitly specify that your operation
*must* run in serial, then use the `serial_for_each`

loop.

```
at::TensorIteratorConfig iter_config;
iter_config
.add_output(out)
.add_input(a)
// call if output was already allocated
.resize_outputs(false)
// call if inputs/outputs have different types
.check_all_same_dtype(false);
auto iter = iter_config.build();
// Copies data from input into output
auto copy_loop = [](char** data, const int64_t* strides, int64_t n) {
auto* out_data = data[0];
auto* in_data = data[1];
for (int64_t i = 0; i < n; i++) {
// assume float data type for this example
*reinterpret_cast<float*>(out_data) = *reinterpret_cast<float*>(in_data);
out_data += strides[0];
in_data += strides[1];
}
};
iter.for_each(copy_loop);
```

#### Using kernels for iterations

Frequently we want to create a kernel that applies a simple point-wise function
onto entire tensors. `TensorIterator`

provides various such generic kernels
that can be used for iterating over the elements of a tensor without having to
worry about the stride, data type of the operands or details of the
parallelism.

For example, say we want to build a function that performs the point-wise
addition of two tensors and stores the result in a third tensor. We can use the
`cpu_kernel`

function. Note that in this example we assume a tensor of `float`

,
but you can use one of the `AT_DISPATCH_ALL_TYPES*`

macros to support multiple
data types.

```
at::TensorIteratorConfig iter_config;
iter_config
.add_output(c)
.add_input(a)
.add_input(b);
auto iter = iter_config.build();
// Element-wise add
at::native::cpu_kernel(iter, [] (float a, float b) -> float {
return a + b;
});
```

Writing the kernel in this way ensures that the value returned by the lambda
passed to `cpu_kernel`

will populate the corresponding position in the target
output tensor, as long as the inputs strictly broadcast over the output--that
is, if the output's shape is equal to or greater than the input shape in all
dimensions.

#### Setting tensor iteration dimensions

The value of the sizes and strides will determine which dimension of the tensor
you will iterate over. `TensorIterator`

performs optimizations to make sure
that at least most of the iterations happen on contiguous data to take
advantage of hierarchical cache-based memory architectures (think dimension
coalescing and reordering for maximum data locality).

A multi-dimensional tensor has a stride value for each dimension. So the stride
that `TensorIterator`

needs to use will be different depending on which
dimension you want to iterate over. `TensorIterator`

directly computes the
strides that get passed into the loop by itself within the `build()`

function.
How exactly it computes the dimension to iterate over is something that should
be properly understood in order to use `TensorIterator`

effectively.

When performing a reduction operation (see the `sum_out`

code in
ReduceOps.cpp),
`TensorIterator`

will figure out the dimensions that will be reduced depending
on the shape of the input and output tensor, which determines how the input
will be broadcast over the output. If you're performing a simple pointwise
operation between two tensors (like a `addcmul`

from
PointwiseOps.cpp)
the iteration will happen over the entire tensor, without providing a choice of
the dimension. This allows `TensorIterator`

to freely parallelize the
computation, without guaranteeing the order of execution, since it does not
matter anyway.

For something like a cumulative sum operation, where you want be able to choose the dimension to reduce but iterate over multiple non-reduced dimensions (possibly in parallel), you must be careful to take into account two different strides--one for the dimension being reduced and one for all other dimensions. Take a look at the following example of a somewhat simplified version of the cumsum kernel.

For a 1-D input,
torch.cumsum
calculates the sum of all elements from the beginning of the vector up to and
including each position in the input. A 2-D input is treated as a list of
vectors, and the cumulative sum is calculated for each vector. Higher
dimensional inputs follow the same logic--everything is just a list of 1-D
vectors. So to implement a cumulative sum, we must take into account two
different strides: the stride between elements in a vector (`result_dim_stride`

and `self_dim_stride`

in the example below) and the stride between each vector
(`strides[0]`

and `strides[1]`

in the example below).

```
// A cumulative sum's output is the same size as the input
at::Tensor result = at::empty_like(self);
at::TensorIteratorConfig iter_config;
auto iter = iter_config
.check_all_same_dtype(false)
.resize_outputs(false)
.declare_static_shape(self.sizes(), /*squash_dim=*/dim)
.add_output(result)
.add_input(self)
.build();
// Size of dimension to calculate the cumulative sum across
int64_t self_dim_size = at::native::ensure_nonempty_size(self, dim);
// These strides indicate number of memory-contiguous elements, not bytes,
// between each successive element in dimension `dim`.
auto result_dim_stride = at::native::ensure_nonempty_stride(result, dim);
auto self_dim_stride = at::native::ensure_nonempty_stride(self, dim);
auto loop = [&](char** data, const int64_t* strides, int64_t n) {
// There are `n` individual vectors that span across dimension `dim`, so
// `n` is equal to the number of elements in `self` divided by the size of
// dimension `dim`.
// These are the byte strides that separate each vector that spans across
// dimension `dim`
auto* result_data_bytes = data[0];
const auto* self_data_bytes = data[1];
for (int64_t vector_idx = 0; vector_idx < n; ++vector_idx) {
// Calculate cumulative sum for each element of the vector
auto cumulative_sum = (at::acc_type<float, false>) 0;
for (int64_t elem_idx = 0; elem_idx < self_dim_size; ++elem_idx) {
const auto* self_data = reinterpret_cast<const float*>(self_data_bytes);
auto* result_data = reinterpret_cast<float*>(result_data_bytes);
cumulative_sum += self_data[elem_idx * self_dim_stride];
result_data[elem_idx * result_dim_stride] = (float)cumulative_sum;
}
// Go to the next vector
result_data_bytes += strides[0];
self_data_bytes += strides[1];
}
};
iter.for_each(loop);
```

#### Helper functions

There are many helper functions within PyTorch that can simplify the creation
and execution of a `TensorIterator`

. We cannot cover all of them in this blog
post, so you would need to discover them on your own. However, let's discuss
one of the most common ones:
`make_reduction`

`make_reduction`

creates a `TensorIterator`

specifically for a reduction
operation with one input and one output. It handles all of the
`TensorIteratorConfig`

setup internally, so we don't need to write as much
boiler-plate code.

The following example uses `make_reduction`

to create a `TensorIterator`

which
is used to calculate the sum reduction of a 2-D input across dimension 1. This
is equivalent to `torch.sum(self, dim=1)`

in Python. If we didn't use
`make_reduction`

, this code would be a bit more complex and more difficult to
write.

In this example, as opposed to the previous examples, we do not need to advance
the `out_data`

pointer. In fact, the value of `strides[0]`

in this case is `0`

.
The reason for this is that the `TensorIterator`

generated by `make_reduction`

was initialized with `is_reduction(true)`

, and when `for_each`

is called,
`sum_reduce_loop`

is executed once per element of the output tensor. Thus,
`sum_reduce_loop`

only needs to iterate across the input data, adding each
input element to the corresponding reduced output element. The operation is
thread-safe as well, so the `for_each`

call is free to split up individual
`sum_reduce_loop`

executions across multiple threads to parallelize the
calculation.

```
at::Tensor self = at::randn({10, 10, 10});
int64_t dim = 1;
bool keepdim = false;
// `make_reduction` will resize result tensor for us, so we
// can set its size to (0)
at::Tensor result = at::empty({0}, self.options());
auto iter = at::native::make_reduction(
"sum_reduce",
result,
self,
dim,
keepdim,
self.scalar_type());
// Sum reduce data from input into output
auto sum_reduce_loop = [](char** data, const int64_t* strides, int64_t n) {
auto* out_data = data[0];
auto* in_data = data[1];
assert(strides[0] == 0);
*reinterpret_cast<float*>(out_data) = 0;
for (int64_t i = 0; i < n; i++) {
// assume float data type for this example
*reinterpret_cast<float*>(out_data) += *reinterpret_cast<float*>(in_data);
in_data += strides[1];
}
};
iter.for_each(sum_reduce_loop);
```

## Conclusion

This post was a very short introduction to what `TensorIterator`

is actually
capable of. If you want to learn more about how it works and what goes into
things like collapsing the tensor size for optimizing memory access, a good
place to start would be the `build()`

function in
TensorIterator.cpp.
Also have a look at this wiki
page from
the PyTorch team on using `TensorIterator.`