Since the beginning, the numpysane library provided a broadcast_define() function to decorate existing Python routines to give them broadcasting awareness. This was very useful, but slow. I just did lots of typing, and now I have a flavor of this in C (the numpysane_pywrap module; new in numpysane 0.22). As expected, you get fast C loops! And similar to the rest of this library, this is a port of something in PDL: PDL::PP.

Full documentation lives here:

https://github.com/dkogan/numpysane/blob/master/README-pywrap.org

After writing this I realized that there was something similar available in numpy this whole time: https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html

I haven't looked too deeply into this yet, but 2 things are clear:

There's a design difference: the numpy implementation uses function callbacks, while I generate C code. Code generation is what PDL::PP does, and when I thought about it earlier, it seemed like doing this with function pointers would be too painful. I guess it's doable, though.

And at least in one case, the gufuncs aren't doing the right broadcasting thing:

>>> a = np.arange(5).reshape(5,1)
>>> b = np.arange(3)

>>> np.matmul(a,b)
ValueError: matmul: Input operand 1 has a mismatch in
   its core dimension 0, with gufunc signature
   (n?,k),(k,m?)->(n?,m?) (size 3 is different from 1)

This should work. And if you do this with numpysane.broadcast_define() or with numpysane_pywrap, it does work. I'll look at it later to figure out what it's doing.