In theory, tasks and regions are sufficient for parallel execution. If you create N
regions, and N
tasks, those will all be able to run in parallel. But it’s more pleasant—and more efficient too—to use data partitioning. Partitioning provides a way to describe the subsets of a region’s elements needed for a given task, without needing to grab the entire thing. This enables parallelism (tasks using different subsets can run in parallel) and also opens the door to more advanced techniques that we’ll see in future tutorials.
Partitions divide regions into a number of subregions. There are many partitioning operators, but the most basic is equal
. The equal
operator creates a specified number of roughly equal-sized subregions. (The subregions are only roughly equal-sized because the number of subregions created may not evenly divide the region being divided.)
The following code divides r
into N
subregions:
Having partitioned r
as p
, we can then access the i
th subregion as p[i]
. This enables us to write, for example, a loop of tasks over the subregions of a partition.
The equal
operator is always guaranteed to produce subregions that satisfy certain relationships. In particular:
The subregions p[i]
and p[j]
are guaranteed to be disjoint (that is, they do not contain any common elements), for all i != j
.
The subregions of p
, taken as a whole, are guaranteed to be complete. That is, the union of p[i]
for all i
is equal to the original parent region.
These properties are not guaranteed to hold for all possible partitions. There are partitions in Regent that are aliased (p[i]
overlaps p[j]
for some i != j
), and ones that are incomplete (the union of subregions does not cover the parent region). We’ll see more of these partitions in future tutorials. But for now, we’ll stick with the equal
operator and its disjoint and complete partitions.
One of the most important properties of subregions is that they are views onto their parent regions. That is, they do not copy their elements from the parent, but refer to the parent’s data directly. This means that changes in the subregion are visible in the parent, and vice versa. (While we haven’t seen any aliased partitions yet, it also means that when two subregions overlap, changes in one are visible in the other.)
This is easiest to see if we create a simple program with one partition:
Here, s
is the first subregion of r
via p
. The parent region r
contains elements 0
through 9
, while s
contains elements 0
through 4
.
If we write to s
, we’ll see that those changes can be seen in r
as well:
Similarly, we can go the other direction:
We’ll get back to this in the next tutorial, and how it interacts with multiple partitions of a region. For now, remember that subregions are views and refer directly to their parent’s elements.
Partitions can be passed to tasks and returned from tasks. One thing to keep in mind is the type of a partition includes the parent region it is derived from. Therefore, when passing a partition to a task, it is necessary to pass the parent region as well.
This task:
Can be called as:
Partitions, unlike regions, cannot take privileges. If you need privileges on the subregions of a partition, specify them on the corresponding parent region.
We can now use partitions to build a data-parallel version of DAXPY. Fortunately, we’ve already done most of the work in the previous version of the code. The tasks, in particular, do not need to be modified at all. While originally intended to operate on the entire input and output regions defined in the program, they’ll work fine on subsets of the data as well. Thus, all we need to do is change main
to set up the partitions, and create new loops to launch the tasks.
(It is not always the case that tasks written without partitioning in mind will work seamlessly with partitioning, but it is true to a surprising degree.)
Recall, from before, that we have two regions we need to be concerned with: input_lr
and output_lr
. The code for these regions is reproduced below.
To introduce data parallelism, we need to partition these regions. We’ll create a basic four-way partition to start. (In practice, one would normally want this to be configurable so that it can easily scale with the machine, but we’ll leave it fixed for now.) The equal
operator makes it easy to set these partitions up.
At this point, we also want to change the task launches to use the partitions. For example, where we used to have one call to daxpy
on the entire input_lr
and output_lr
, we’ll now have four calls with each of the pieces of input_lp
and output_lp
.
We also took the opportunity to mark the loop as __demand(__index_launch)
. This instructs the compiler to verify that the loop is eligible to be executed in parallel.
In fact, because each of the partitions is disjoint, the tasks do indeed run in parallel. This gives us the parallel implementation of DAXPY.
The final code (shown below) is now written in a form that is parallel and would run distributed if we had an appropriate machine to run on. For DAXPY, there’s not much more to do. However, many applications involve data access patterns that are more complicated than DAXPY (for example, halo or ghost cell exchanges). We’ll consider those applications in a future tutorial.