1D dilation performance comparison #198

elgw · 2025-03-20T16:16:03Z

elgw
Mar 20, 2025

Hello!

I lazily tested the 1D dilation performance using convex binary structuring elements, like [1] (w=1), [1,1,1] (w=3), etc. and though that I should what I found. Included was

imdilate from MATLAB
dip::Dilation
and some code of my own.

implementation details

In MATLAB I timed with:

x = rand(n, 1, 'single');
tic
y = imdilate(t, ones(w, 1));
toc

DIPlib was used like this:

dip::Image img( { n, 1 }, 1, dip::DT_SFLOAT );
dip::GaussianNoise( img, img, rndGen, 400.0 );
dip::StructuringElement se = dip::StructuringElement({w, 0}, "fast line");
tic
dip::Dilation( img, out, se);
toc

Results

Observations

The imdilate behavior of increasing performance with larger structuring elements was a surprise :) Now that my code also has that characteristics I can see why.

It looks like w=3 is faster than w=1, most likely it is a benchmarking artefact (should have done it properly from the beginning).

For small sizes, especially w=3 I would guess that brute force is used. The results indicates that both MATLAB/imdilate and DIPlib/"discrete line" can benefit from delaying the switch from brute force to something more clever. My implementation found the sweet spot around 19 (most likely machine dependent).

My current implementation uses a circular buffer with a pointer to the max element which I update when a larger value is inserted or when the current maxima goes out of scope (triggers w comparisons). The average performance is good but the worst case (monotonically ascending input values) degrades the performance to O(n*w). I guess that I can find the theoretical performance estimates of the methods in DIPlib if I follow the references.

Cheers,
Erik

crisluengo · 2025-03-20T17:31:55Z

crisluengo
Mar 20, 2025
Maintainer

1 reply

elgw Mar 20, 2025
Author

Thanks! The performance from "rectangular" and "fast line" seems identical, there must be a separate code path when the "fast line" is aligned to the axis.

crisluengo · 2025-03-20T21:17:27Z

crisluengo
Mar 20, 2025
Maintainer

I was wrong, 'fast line' defers to 'rectangular' if the line goes along an image axis.

'rectangular' and 'fast line' use an O(1) algorithm along each axis. This algorithm requires two buffers of the same size as the image line. Having a very long image line like your case increases the cost. The break point for switching from brute force to this algorithm is much lower for "normal" image sizes. I don't know how to pick a value that is always right.

If you set n = 22e3 instead of 22e6, you will see a very different plot.

The 'discrete line' algorithm is the same as used for 'elliptic' and custom shapes. This algorithm splits the SE into runs along the axis where you get fewest runs (a run is a contiguous set of pixels). For your use of this method you get one run. It then keeps a number of steps for the max value within each run, each time the SE is shifted right, the number of steps is reduced by one. When this value reaches zero, the max value has exited the SE, and the new max for the run is found by brute force. Also, for every step, the new value that comes in is examined to see if it will be the new max (which is the case also if it's equal in value to the current max).

This is thus very similar to your method, but without keeping a buffer of the values. I don't know why your implementation is so much faster though. Maybe it's just that the DIPlib code works for any number of runs, not just one? Wouldn't mind comparing.

6 replies

crisluengo Mar 21, 2025
Maintainer

Thanks for the code! Very neat implementation.

I guess the DIPlib 'rectangular' code should split very long image lines into chunks.

elgw Mar 26, 2025
Author

Just when I though that I was done, I felt the urge to implement just another method.

First I picked

Petr Dokládal, Eva Dokladalova. Computationally efficient, one-pass algorithm for morphological
filters. Journal of Visual Communication and Image Representation, 2011, 22 (5), pp.411–420.
10.1016/j.jvcir.2011.03.005.

Algorithmically it should be very efficient but I the performance that I could get out of it was poor. The while loop (I could not figure how to eliminate it) is probably detrimental for modern CPUs -- or my implementation was particularly sloppy.

Since I got a little disapointed, wanted to try yet another one. This time I picked a method constant computation time regardless of input (no while loops):

Coltuc, Dinu, and Philippe Bolon. "Very efficient implementation of max/min filters." NSIP. 1999

It uses 3 comparisons per pixel element regardless of the size of the kernel, $w$. The trend with increasing time with $w$ that you can see in the plot could be due to my code that handles the edges with brute force in combination with more memory to access with larger $w$.

Results indicates that this is useful! This time I used vectors of size 2024 for the benchmark which is probably more relevant than what I used before.

The "vanilla" inner loop looks like this and updates $s=w-1$ elements:

static __attribute__((always_inline)) inline
void C3(const flo * restrict X, flo * restrict Y,
               flo * restrict A, flo * restrict D, const i64 s)
{
    A[0] = X[s];

    for(i64 k = 1; k < s; k++)
    {
        A[k] = sfmax(A[k-1], X[s+k]);
    }
    D[s-1] = X[s-1];
    for(i64 k = s-2; k >= 0; k--)
    {
        D[k] = sfmax(D[k+1], X[k]);
    }

    for(i64 k = 0; k < s; k++)
    {
        Y[k] = sfmax(A[k], D[k]);
    }
}

The array A does not need to be stored as the values can be generated on the fly (that is how I implemented it). As before there is a significant decrease in computation time if the inner loop is compiled with $s$ as a constant (not shown in the plot).

The code implicitly assumes that the size of the structuring element is much smaller than the input data (in progress).

Full code

static __attribute__((always_inline)) inline
void C3_opt(const flo * restrict X, flo * restrict Y,
        flo * restrict D, const i64 s)
{

    D[s-1] = X[s-1];
    for(i64 k = s-2; k >= 0; k--)
    {
        D[k] = sfmax(D[k+1], X[k]);
    }
    flo AA = 0;
    for(i64 k = 0; k < s; k++)
    {
        AA = sfmax(AA, X[s+k]);
        Y[k] = sfmax(AA, D[k]);
    }
}


flo *
FNAME(maxfilter1d_c3)(const flo * restrict X,
                      const i64 n,
                      const i64 w,
                      flo * restrict _Y)
{
    flo * restrict Y = _Y;
    if(Y == NULL)
    {
        Y = calloc(n, sizeof(flo));
    }
    if(Y == NULL)
    {
        return NULL;
    }

    if(w == 1)
    {
        for(i64 kk = 0; kk < n; kk++)
        {
            Y[kk] = X[kk];
        }
        return Y;
    }

    /* first part where the strel overlaps the boundary */
    if(1){
    for(i64 k = 0; k < (w-1)/2; k++)
    {
        i64 nuse = k+(w-1)/2+1;
        if(nuse > n)
        {
            nuse = n;
        }
        Y[k] = flo_array_max(X, nuse);
    }
    }

    const i64 s = w-1;
    flo D[s];
    flo A[s];

    for(i64 k = 0; k+2*s < n; k+=s)
    {
        //C3(X+k, Y+k + s/2 , A, D, s);
        C3_opt(X+k, Y+k + s/2, D, s);
    }

    {
        i64 last = n-2*s-1;
        C3(X+last, Y+last + s/2 , A, D, s);
    }

    /* last part */
    if(1)
    {
        for(i64 k = n - (w-1)/2-1; k < n; k++)
    {
        i64 startpos = k - (w-1)/2;
        Y[k] = flo_array_max(X+startpos, n-startpos);
    }
    }

    return Y;
}

crisluengo Mar 26, 2025
Maintainer

I think what that is doing is like the O(1) van Herk algorithm that is in DIPlib, but it creates only one section of the two buffers at the time. That must be good for cache usage. Really interesting! I didn't know about that paper! (Also, it's a paper from 1999, but it doesn't reference van Herk's 1992 paper where it borrows heavily from).

I have implemented a 1D opening and closing that was supposed to be much more efficient than doing separate dilation and erosion passes, but my implementation didn't match the performance shown in the paper:

diplib/src/morphology/one_dimensional.cpp

Line 697 in 21bb1ac

    
                       // Vincent Morard's algorithm: build a max-tree, which is pruned as it's build, what is left is the opening.

elgw Mar 28, 2025
Author

Yes, I agree. It seems identical to the Herk algorithm, which is the same as the Gil-Werman from the following year? The difference is an implementation detail.

I briefly looked at the Gil, J Y and Kimmel, R paper as well. Although it shaves off a few comparisons I doubt that it can be faster than HGW in practice, the fact that there are no benchmarks in the paper contributes to that guess. The Dokládal & Dokladalova algorithm also looks good in theory but in practice it used about 5x the time compared to HGW.

I'm quite happy with the final performance once that I fixed the handling of the edge cases, here again for inputs of size 2048.

Of course the comparison is not fair since I've hard coded for a stride of 1 while DIPlib has that as a parameter.

And I agree, it is more or less impossible to transfer benchmarks from old architectures to new machines. Surprises goes in both directions :)

Hoping to find some time look into the ND case in the future, although I have no needs beyond a 3D sphere at the moment.

By the way I think that the constructor of dip::Image should switch to 0-initialized memory, that is what I assumed until I got bitten. What would be the use of non-initialized memory?

crisluengo Mar 28, 2025
Maintainer

Nice! That’s quite a win!

I’m not going to have time to update the DIPlib implementation in the next two months or so, but will do this when I have time again. If you want to do it, that would be awesome.

A lot of functions in DIPlib create an image to fill it with data (all filters, file reading, the Create…() functions, etc etc etc. There actually are not many cases where we need a zero-initialized image. Search for “Fill” in the library sources, you’ll see most uses are in testing. There are some cases of Fill(0) in the code, but also uses with a different value.

I guess we could have an equivalent of the MATLAB zeros(), which would be trivial to implement.

elgw · 2025-04-01T08:35:26Z

elgw
Apr 1, 2025
Author

That would be nice addition. Following the path of least surprise. As a bonus calloc should be faster than malloc + memset.

The literature on structure element decomposition is quite rich so I picked something that I could implement quickly.

Urbach, E. R., & Wilkinson, M. H. F. (2008). Efficient 2-D
Grayscale Morphological Transformations With Arbitrary Flat
Structuring Elements. Ieee transactions on image processing,
17(1), 1-8. https://doi.org/10.1109/TIP.2007.912582

I guess that you've seen it. In summary: it performs a line decomposition and has some tricks to reduce the memory usage low. My implementation can only handle structure elements where the lines are centered and of odd size . Against MATLAB it is quite performant, especially given the few lines of code needed:

In this case disk-0 is the "true" disk (`strel('disk', r, 0)') while the 4- and 8- are the standard approximations. It is a big mystery why the disk-4 becomes slower than disk-8 around r=35.

If you implemented something from scratch for 2D + 3D, what would you pick? (I know that you always pick from the batch of ND methods).

When I've added a parameter for stride to my 1d erosion/dilation filter I'll see how to make it fit with DIPlib, but don't count on it. The table is quite full until summer.

1 reply

crisluengo Apr 1, 2025
Maintainer

I was first thinking of the decomposition of the SE into a small set of line segments, like the disk-4 and disk-8 in MATLAB do. In DIPlib we have the octagonal SE, which is an approximation to the disk using 4 lines at 4 different angles. There is a paper that describes how to decompose a disk into lines to obtain an arbitrarily accurate approximation. I never got around to implement that, it has not been a priority. It is hard to know at what size the approximation is faster than the full SE (this depends on hardware).

I remember reading the Urbach paper when it came out. It requires a lot of intermediate memory. DIPlib’s implementation needs only the input and output images, no additional memory is used. This implementation also splits the SE into horizontal lines like Urbach does, but doesn’t use the O(1) algorithm on each line, instead it does as I explained in an earlier comment here. This works for arbitrarily complex SEs in an arbitrary number of dimensions. Best case it is O(H), worst case it is O(N) (N pixels total in the SE, split into H runs), in practice it is closer to O(H) than to O(N). Urbach’s approach should be fully O(H), but with O(H) additional memory.

elgw · 2025-04-01T19:37:15Z

elgw
Apr 1, 2025
Author

The memory requirement is sure a downside. And the overall situation with one method per class of structuring element is not ideal (and possibly x the number of hardware configurations as well). Too much code to maintain and test!

If the strel is H x H and the image is N x N, then one dilation per unique line length in the strel will be stored for the H "hot" lines of the input image. Worst case extra storage will be H x H x N. In best case, like for a rectangle the additional memory usage is down to H x N.

There is always the option to only store a logarithmic number of dilations, 2^1+1, 2^2+1, 2^4+1 ... and then calculate the intermediate dilations by a single comparison per pixel "on the fly" so there is a possible memory to performance tradeoff to be made.

Two comparisons vs DIPlib. Input is a 2048 x 2048 float image. w is the side length of the structuring element.

For squares my implementation requires H comparisons per pixel to assemble along the strided axis. So the method is no silver bullet.

For spheres my implementation is a little faster which is nice since it can handle a larger class of structuring elements (any -- if one follow the paper closely).

My implementation was not very faithful to the original so it is possible that I got a thing, or two, wrong -- I jumped directly to the implemention phase once I (though) that I got the idea.

Thanks for all the advice! It is great fun to get some first hand experience with these things that I've been using for such a long time now :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1D dilation performance comparison #198

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

1D dilation performance comparison #198

elgw Mar 20, 2025

Results

Observations

Replies: 4 comments · 8 replies

crisluengo Mar 20, 2025 Maintainer

elgw Mar 20, 2025 Author

crisluengo Mar 20, 2025 Maintainer

crisluengo Mar 21, 2025 Maintainer

elgw Mar 26, 2025 Author

crisluengo Mar 26, 2025 Maintainer

elgw Mar 28, 2025 Author

crisluengo Mar 28, 2025 Maintainer

elgw Apr 1, 2025 Author

crisluengo Apr 1, 2025 Maintainer

elgw Apr 1, 2025 Author

elgw
Mar 20, 2025

Replies: 4 comments 8 replies

crisluengo
Mar 20, 2025
Maintainer

elgw Mar 20, 2025
Author

crisluengo
Mar 20, 2025
Maintainer

crisluengo Mar 21, 2025
Maintainer

elgw Mar 26, 2025
Author

crisluengo Mar 26, 2025
Maintainer

elgw Mar 28, 2025
Author

crisluengo Mar 28, 2025
Maintainer

elgw
Apr 1, 2025
Author

crisluengo Apr 1, 2025
Maintainer

elgw
Apr 1, 2025
Author