.. note::
    :class: sphx-glr-download-link-note

    Click :ref:`here <sphx_glr_download_tutorials_tutorial_06_memory.py>` to download the full example code
.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials_tutorial_06_memory.py:


Memory Customization
====================

**Author**: Yi-Hsiang Lai (seanlatias@github)

In this tutorial, we demonstrate how memory customization works in HeteroCL.

.. code-block:: default

    import heterocl as hcl
    import numpy as np


Memory Customization in HeteroCL
--------------------------------
There are two types of memory customization in HeteroCL. The first one is
similar to what we have seen in
:ref:`sphx_glr_tutorials_tutorial_04_compute.py`, where we demonstrate some
primitives that will be synthesized as pragmas. An example of such primitive
is ``partition``. Following is an example. Note that the primitive is
directly applied on the schedule instead of a stage. This is because we are
modifying the property of a tensor.


.. code-block:: default


    hcl.init()

    A = hcl.placeholder((10, 10), "A")

    def kernel(A):
        return hcl.compute(A.shape, lambda x, y: A[x][y]+1, "B")

    s = hcl.create_schedule(A, kernel)
    s.partition(A)
    print(hcl.lower(s))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    array partition variable=A complete factor=0 dim=0
    // attr [_top] storage_scope = "global"
    allocate _top[int32 * 1]
    produce _top {
      // attr [0] extern_scope = 0
      produce B {
        // attr [0] extern_scope = 0
        for (x, 0, 10) {
          for (y, 0, 10) {
            B[(y + (x*10))] = (A[(y + (x*10))] + 1)
          }
        }
      }
    }


In the IR, we should see a line that annotates tensor ``A`` to be
partitioned completely.

.. note::

   For more information, please see
   :obj:`heterocl.schedule.Schedule.partition`

Another example is to reshape a tensor. This is helpful when we combine
partitioning with loop titling. In this example, we split the inner axis
``y`` and also reshape the output tensor ``B``. After that, we pipeline
the middle axis ``yo`` and partition the output tensor accordingly. **Note
that the** ``reshape`` **primitive cannot be applied to the input tensors.**


.. code-block:: default


    hcl.init()

    s = hcl.create_schedule(A, kernel)
    yo, yi = s[kernel.B].split(kernel.B.axis[1], 5)
    s[kernel.B].pipeline(yo)
    s.reshape(kernel.B, (10, 2, 5))
    s.partition(kernel.B, dim=3)
    print(hcl.build(s, target="vhls"))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    #include <ap_int.h>
    #include <ap_fixed.h>
    #include <hls_stream.h>
    #include <math.h>
    #include <stdint.h>
    void default_function(ap_int<32> A[10*10], ap_int<32> B[10*2*5]) {
      ap_int<32> _top;
      #pragma HLS array_partition variable=B complete dim=3
      for (ap_int<32> x = 0; x < 10; ++x) {
        for (ap_int<32> y_outer = 0; y_outer < 2; ++y_outer) {
        #pragma HLS pipeline
          for (ap_int<32> y_inner = 0; y_inner < 5; ++y_inner) {
            B[((y_inner + (y_outer * 5)) + (x * 10))] = (A[((y_inner + (y_outer * 5)) + (x * 10))] + 1);
          }
        }
      }
    }


Data Reuse in HeteroCL
----------------------
The other type of memory customization primitives involves the introduction
of allocation of new memory buffers. An example is data reuse. The idea of
data reuse is to reduce the number of accesses to a tensor by introducing
an intermediate buffer that holds the values being reused across different
iterations. This finally leads to better performance in hardware.

Example: 2D Convolution
-----------------------
To demonstrate this, we use the computation of 2D convolution as an example.
Let's see how we can define 2D convolution in HeteroCL.


.. code-block:: default


    hcl.init()

    A = hcl.placeholder((6, 6), "A")
    F = hcl.placeholder((3, 3), "F")

    def kernel(A, F):
        r = hcl.reduce_axis(0, 3)
        c = hcl.reduce_axis(0, 3)
        return hcl.compute((4, 4),
                lambda y, x: hcl.sum(A[y+r, x+c]*F[r, c], axis=[r, c]), "B")

    s = hcl.create_schedule([A, F], kernel)
    print(hcl.lower(s))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    // attr [_top] storage_scope = "global"
    allocate _top[int32 * 1]
    produce _top {
      // attr [0] extern_scope = 0
      produce B {
        // attr [0] extern_scope = 0
        for (y, 0, 4) {
          for (x, 0, 4) {
            // attr [sum] storage_scope = "global"
            allocate sum[int32 * 1]
            produce sum {
              // attr [0] extern_scope = 0
              sum[0] = 0
            }
            for (ra0, 0, 3) {
              for (ra1, 0, 3) {
                sum[0] = int32((int65((int64(A[((x + ra1) + ((y + ra0)*6))])*int64(F[(ra1 + (ra0*3))]))) + int65(sum[0])))
              }
            }
            B[(x + (y*4))] = sum[0]
          }
        }
      }
    }


In the above example, we convolve the input tensor ``A`` with a filter ``F``.
Then, we store the output in tensor ``B``. Note that the output shape is
different from the shape of the input tensor. Let's give some real inputs.


.. code-block:: default


    hcl_A = hcl.asarray(np.random.randint(0, 10, A.shape))
    hcl_F = hcl.asarray(np.random.randint(0, 10, F.shape))
    hcl_B = hcl.asarray(np.zeros((4, 4)))
    f = hcl.build(s)
    f(hcl_A, hcl_F, hcl_B)
    print('Input:')
    print(hcl_A)
    print('Filter:')
    print(hcl_F)
    print('Output:')
    print(hcl_B)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Input:
    [[4 7 7 0 1 7]
     [8 8 5 6 2 2]
     [9 7 5 0 4 6]
     [8 7 6 9 1 5]
     [4 9 6 6 3 2]
     [6 8 6 3 2 0]]
    Filter:
    [[1 8 9]
     [6 9 1]
     [9 1 5]]
    Output:
    [[361 230 167 161]
     [348 303 173 189]
     [302 269 234 221]
     [321 343 247 156]]


To analyze the data reuse, let's take a closer look to the generated IR.
To begin with, we can see that in two consecutive iterations of ``x`` (i.e.,
the inner loop), there are 6 pixels that are overlapped, as illustrated in
the figure below. Without any optimization, **we are reading 9 values from
the input for each iteration**.

.. figure:: ../../../tutorials/moving_x/Slide1.png
   :scale: 60 %

Introduce Data Reuse: Window Buffer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To reuse the overlapped pixels, we can introduce a reuse buffer. Since the
filter moves like a window, we call the buffer a window buffer ``WB``. The
window buffers stores the reused pixels and also the new pixels that will
be used in the current iteration. For each iteration, to update the values
inside the window buffer, the last two columns, in this case, shift left.
After that, the last column is replaced with the pixels read from the
input. Now, we only **read 3 values from the input for each iteration**.

.. figure:: ../../../tutorials/moving_x/Slide2.png
   :scale: 60 %

To introduce such reuse buffers in HeteroCL, we use the API ``reuse_at``.
The first argument is the **tensor** whose values will be reused. The
second argument is the output **stage** that reuses the values of the
tensor. The reason why we need to specify this is because we may have
multiple stages reusing the values from the same input tensor. The third
argument is the desired axis to be reused. It must be the output axis.
Finally, we can specify the name of the reuse buffer. The API returns
a new tensor.


.. code-block:: default


    s_x = hcl.create_schedule([A, F], kernel)
    WB = s_x.reuse_at(A, s_x[kernel.B], kernel.B.axis[1], "WB")
    print(hcl.lower(s_x))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    // attr [_top] storage_scope = "global"
    allocate _top[int32 * 1]
    produce _top {
      // attr [0] extern_scope = 0
      produce B {
        // attr [WB] storage_scope = "global"
        allocate WB[int32 * 3 * 3]
        // attr [0] extern_scope = 0
        for (y, 0, 4) {
          for (x.reuse, 0, 6) {
            produce WB {
              for (A.1, 0, 3) {
                for (A.0, 0, 2) {
                  WB[(A.0 + (A.1*3))] = WB[((A.0 + (A.1*3)) + 1)]
                }
                WB[((A.1*3) + 2)] = A[(x.reuse + ((y + A.1)*6))]
              }
            }
            if ((2 <= x.reuse)) {
              // attr [sum] storage_scope = "global"
              allocate sum[int32 * 1]
              produce sum {
                // attr [0] extern_scope = 0
                sum[0] = 0
              }
              for (ra2, 0, 3) {
                for (ra3, 0, 3) {
                  sum[0] = int32((int65((int64(WB[(ra3 + (ra2*3))])*int64(F[(ra3 + (ra2*3))]))) + int65(sum[0])))
                }
              }
              B[((x.reuse + (y*4)) + -2)] = sum[0]
            }
          }
        }
      }
    }


In the printed IR, you should be able to see a buffer ``WB`` with size
(3, 3) being allocated. Moreover, in the ``produce WB`` scope, you should
see the update algorithm described above. Now let's test the function again.


.. code-block:: default


    hcl_Bx = hcl.asarray(np.zeros((4, 4)))
    f = hcl.build(s_x)
    f(hcl_A, hcl_F, hcl_Bx)
    print('Output without WB:')
    print(hcl_B)
    print('Output with WB:')
    print(hcl_Bx)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Output without WB:
    [[361 230 167 161]
     [348 303 173 189]
     [302 269 234 221]
     [321 343 247 156]]
    Output with WB:
    [[361 230 167 161]
     [348 303 173 189]
     [302 269 234 221]
     [321 343 247 156]]


You should see the same results with and without the window buffer.

Reuse at a Different Dimension: Linebuffer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Similarly, we can create a reuse buffer for two consecutive iterations of
``y``. In this case, in each iteration of ``y``, we read an entire row from
input ``A``. Meanwhile, we update the reuse buffer by shifting up. Since it
reads an entire line at a time, we call it a linebuffer ``LB``. The
operation is illustrated in the figure below.

.. figure:: ../../../tutorials/moving_x/Slide3.png
   :scale: 60 %

Similar to the window buffer, we can introduce the linebuffer in HeteroCL by
using a single ``reuse_at`` API. We show the code below.


.. code-block:: default


    s_y = hcl.create_schedule([A, F], kernel)
    LB = s_y.reuse_at(A, s_y[kernel.B], kernel.B.axis[0], "LB")
    print(hcl.lower(s_y))

    hcl_By = hcl.asarray(np.zeros((4, 4)))
    f = hcl.build(s_y)
    f(hcl_A, hcl_F, hcl_By)
    print('Output without LB:')
    print(hcl_B)
    print('Output with LB:')
    print(hcl_By)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    // attr [_top] storage_scope = "global"
    allocate _top[int32 * 1]
    produce _top {
      // attr [0] extern_scope = 0
      produce B {
        // attr [LB] storage_scope = "global"
        allocate LB[int32 * 3 * 6]
        // attr [0] extern_scope = 0
        for (y.reuse, 0, 6) {
          produce LB {
            for (A.0, 0, 6) {
              for (A.1, 0, 2) {
                LB[(A.0 + (A.1*6))] = LB[((A.0 + (A.1*6)) + 6)]
              }
              LB[(A.0 + 12)] = A[(A.0 + (y.reuse*6))]
            }
          }
          for (x, 0, 4) {
            if ((2 <= y.reuse)) {
              // attr [sum] storage_scope = "global"
              allocate sum[int32 * 1]
              produce sum {
                // attr [0] extern_scope = 0
                sum[0] = 0
              }
              for (ra4, 0, 3) {
                for (ra5, 0, 3) {
                  sum[0] = int32((int65((int64(LB[((x + ra5) + (ra4*6))])*int64(F[(ra5 + (ra4*3))]))) + int65(sum[0])))
                }
              }
              B[((x + (y.reuse*4)) + -8)] = sum[0]
            }
          }
        }
      }
    }

    Output without LB:
    [[361 230 167 161]
     [348 303 173 189]
     [302 269 234 221]
     [321 343 247 156]]
    Output with LB:
    [[361 230 167 161]
     [348 303 173 189]
     [302 269 234 221]
     [321 343 247 156]]


Note that the difference between WB and LB is the we reuse at different
axes. We can also see from the printed IR that the allocated size is larger,
which is the same as illustrated in the figure above. In this case, we read
6 pixels from the input for each iteration of ``y``, which means we read 1
pixel for each iteration of ``x`` **effectively**. Namely, this is not true
in terms of hardware execution. Can we do even better?

Combine Window Buffer and Linebuffer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We do not need to restrict ourselves to reuse at a single dimension. Since
we have data reuse in both dimension, we can reuse both. In this case,
we generate both a linebuffer and a window buffer. Let's take a look at
the figure first.

.. figure:: ../../../tutorials/moving_x/Slide4.png
   :scale: 60 %

What happens here is that, we first update the linebuffer (blue arrows),
then we update the window buffer (purple arrows). More precisely, for each
iteration of ``x``, we **read 1 pixel from input ``A``**. We simultaneously
shift up the linebffer. After we update the linebuffer, we go on update the
window buffer by reading pixels updated in the linebuffer. Then we shift the
window buffer. To describe such behavior in HeteroCL is very easy. We only
need to apply ``reuse_at`` twice. We just need to specify the corresponding
reuse tensors and the reuse axes. In this case, the linebuffer reuses the
pixels from the input ``A`` while the window buffer reuses from the
linebuffer. Following we show the code and its IR.


.. code-block:: default


    s_xy = hcl.create_schedule([A, F], kernel)
    LB = s_xy.reuse_at(A, s_xy[kernel.B], kernel.B.axis[0], "LB")
    WB = s_xy.reuse_at(LB, s_xy[kernel.B], kernel.B.axis[1], "WB")
    print(hcl.lower(s_xy))

    hcl_Bxy = hcl.asarray(np.zeros((4, 4)))
    f = hcl.build(s_xy)
    f(hcl_A, hcl_F, hcl_Bxy)
    print('Output without reuse buffers:')
    print(hcl_B)
    print('Output with reuse buffers:')
    print(hcl_Bxy)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    // attr [_top] storage_scope = "global"
    allocate _top[int32 * 1]
    produce _top {
      // attr [0] extern_scope = 0
      produce B {
        // attr [LB] storage_scope = "global"
        allocate LB[int32 * 3 * 6]
        // attr [WB] storage_scope = "global"
        allocate WB[int32 * 3 * 3]
        // attr [0] extern_scope = 0
        for (y.reuse, 0, 6) {
          for (x.reuse, 0, 6) {
            produce LB {
              for (A.1, 0, 2) {
                LB[(x.reuse + (A.1*6))] = LB[((x.reuse + (A.1*6)) + 6)]
              }
              LB[(x.reuse + 12)] = A[(x.reuse + (y.reuse*6))]
            }
            if ((2 <= y.reuse)) {
              produce WB {
                for (LB.1, 0, 3) {
                  for (LB.0, 0, 2) {
                    WB[(LB.0 + (LB.1*3))] = WB[((LB.0 + (LB.1*3)) + 1)]
                  }
                  WB[((LB.1*3) + 2)] = LB[(x.reuse + (LB.1*6))]
                }
              }
              if ((2 <= x.reuse)) {
                // attr [sum] storage_scope = "global"
                allocate sum[int32 * 1]
                produce sum {
                  // attr [0] extern_scope = 0
                  sum[0] = 0
                }
                for (ra6, 0, 3) {
                  for (ra7, 0, 3) {
                    sum[0] = int32((int65((int64(WB[(ra7 + (ra6*3))])*int64(F[(ra7 + (ra6*3))]))) + int65(sum[0])))
                  }
                }
                B[((x.reuse + (y.reuse*4)) + -10)] = sum[0]
              }
            }
          }
        }
      }
    }

    Output without reuse buffers:
    [[361 230 167 161]
     [348 303 173 189]
     [302 269 234 221]
     [321 343 247 156]]
    Output with reuse buffers:
    [[361 230 167 161]
     [348 303 173 189]
     [302 269 234 221]
     [321 343 247 156]]


We can see from the IR that the allocation sizes are indeed as expected.

Further Optimization
~~~~~~~~~~~~~~~~~~~~
To further optimize the design, we need to think more carefully. For each
iteration of ``x``, there are three pixels in LB that are being read/write
simultaneously. Thus, to maximize the memory bandwidth, we need to partition
LB in the row direction. For WB, all pixels are updated at the same time.
Therefore, we need to partition the whole WB completely. Finally, we can
pipeline the whole design. Don't forget that we also need to partition the
filter ``F``.


.. code-block:: default


    s_final = hcl.create_schedule([A, F], kernel)
    LB = s_final.reuse_at(A, s_final[kernel.B], kernel.B.axis[0], "LB")
    WB = s_final.reuse_at(LB, s_final[kernel.B], kernel.B.axis[1], "WB")
    s_final.partition(LB, dim=1)
    s_final.partition(WB)
    s_final.partition(F)
    s_final[kernel.B].pipeline(kernel.B.axis[1])
    print(hcl.lower(s_final))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    array partition variable=F complete factor=0 dim=0
    // attr [_top] storage_scope = "global"
    allocate _top[int32 * 1]
    produce _top {
      // attr [0] extern_scope = 0
      produce B {
        // attr [LB] storage_scope = "global"
        allocate LB[int32 * 3 * 6]
        array partition variable=LB complete factor=0 dim=1
        // attr [WB] storage_scope = "global"
        allocate WB[int32 * 3 * 3]
        array partition variable=WB complete factor=0 dim=0
        // attr [0] extern_scope = 0
        for (y.reuse, 0, 6) {
          pipelined "initiation_interval"=1 (x.reuse, 0, 6) {
            produce LB {
              for (A.1, 0, 2) {
                LB[(x.reuse + (A.1*6))] = LB[((x.reuse + (A.1*6)) + 6)]
              }
              LB[(x.reuse + 12)] = A[(x.reuse + (y.reuse*6))]
            }
            if ((2 <= y.reuse)) {
              produce WB {
                for (LB.1, 0, 3) {
                  for (LB.0, 0, 2) {
                    WB[(LB.0 + (LB.1*3))] = WB[((LB.0 + (LB.1*3)) + 1)]
                  }
                  WB[((LB.1*3) + 2)] = LB[(x.reuse + (LB.1*6))]
                }
              }
              if ((2 <= x.reuse)) {
                // attr [sum] storage_scope = "global"
                allocate sum[int32 * 1]
                produce sum {
                  // attr [0] extern_scope = 0
                  sum[0] = 0
                }
                for (ra8, 0, 3) {
                  for (ra9, 0, 3) {
                    sum[0] = int32((int65((int64(WB[(ra9 + (ra8*3))])*int64(F[(ra9 + (ra8*3))]))) + int65(sum[0])))
                  }
                }
                B[((x.reuse + (y.reuse*4)) + -10)] = sum[0]
              }
            }
          }
        }
      }
    }


Finally, we can generate the HLS code and see if the II is indeed 1.


.. code-block:: default


    f = hcl.build(s_final, target="vhls")


Following is a sample report from Vivado_HLS.

.. code::

   + Latency (clock cycles):
       * Summary:
       +-----+-----+-----+-----+---------+
       |  Latency  |  Interval | Pipeline|
       | min | max | min | max |   Type  |
       +-----+-----+-----+-----+---------+
       |   42|   42|   43|   43|   none  |
       +-----+-----+-----+-----+---------+

       + Detail:
           * Instance:
           N/A

           * Loop:
           +----------+-----+-----+----------+-----------+-----------+------+----------+
           |          |  Latency  | Iteration|  Initiation Interval  | Trip |          |
           | Loop Name| min | max |  Latency |  achieved |   target  | Count| Pipelined|
           +----------+-----+-----+----------+-----------+-----------+------+----------+
           |- Loop 1  |   40|   40|         6|          1|          1|    36|    yes   |
           +----------+-----+-----+----------+-----------+-----------+------+----------+


Limitations
-----------
Following we list the limitations of using reuse buffers in HeteroCL.

1. We do not accept non-linear index patterns, e.g., ``y*y+c``, ``y*(y+c)``
2. The stride is not one, e.g., ``2*y+c``
3. There is no overlapped pixel between two consecutive iterations of the
   specified axis, e.g., ``[x+r, y]`` and reuse ``y``

More Examples: 2D Image Blur
----------------------------
HeteroCL is also able to infer reuse buffers for explicit reduction
operations. Namely, instead of using ``hcl.sum``, we can expand the compute
patterns. Following is an example of 2D blur.


.. code-block:: default


    hcl.init()
    A = hcl.placeholder((10, 10), "A")

    def kernel_blur(A):
        return hcl.compute((8, 8), lambda y, x: A[y, x] + A[y+1, x+1] + A[y+2, x+2], "B")

    s_blur = hcl.create_schedule(A, kernel_blur)
    B = kernel_blur.B
    RB_y = s_blur.reuse_at(A, s_blur[B], B.axis[0], "RB_y")
    RB_x = s_blur.reuse_at(RB_y, s_blur[B], B.axis[1], "RB_x")
    print(hcl.lower(s_blur))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    // attr [_top] storage_scope = "global"
    allocate _top[int32 * 1]
    produce _top {
      // attr [0] extern_scope = 0
      produce B {
        // attr [RB_y] storage_scope = "global"
        allocate RB_y[int32 * 3 * 10]
        // attr [RB_x] storage_scope = "global"
        allocate RB_x[int32 * 3 * 3]
        // attr [0] extern_scope = 0
        for (y.reuse, 0, 10) {
          for (x.reuse, 0, 10) {
            produce RB_y {
              for (A.1, 0, 2) {
                RB_y[(x.reuse + (A.1*10))] = RB_y[((x.reuse + (A.1*10)) + 10)]
              }
              RB_y[(x.reuse + 20)] = A[(x.reuse + (y.reuse*10))]
            }
            if ((2 <= y.reuse)) {
              produce RB_x {
                for (RB_y.1, 0, 3) {
                  for (RB_y.0, 0, 2) {
                    RB_x[(RB_y.0 + (RB_y.1*3))] = RB_x[((RB_y.0 + (RB_y.1*3)) + 1)]
                  }
                  RB_x[((RB_y.1*3) + 2)] = RB_y[(x.reuse + (RB_y.1*10))]
                }
              }
              if ((2 <= x.reuse)) {
                B[((x.reuse + (y.reuse*8)) + -18)] = int32((int34((int33(RB_x[0]) + int33(RB_x[4]))) + int34(RB_x[8])))
              }
            }
          }
        }
      }
    }


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.431 seconds)


.. _sphx_glr_download_tutorials_tutorial_06_memory.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download

     :download:`Download Python source code: tutorial_06_memory.py <tutorial_06_memory.py>`


  .. container:: sphx-glr-download

     :download:`Download Jupyter notebook: tutorial_06_memory.ipynb <tutorial_06_memory.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.readthedocs.io>`_