> Can different iterations of this loop run independently?
I'm not a compiler guy, but vectorisation algorithms typically analyze loop-carried dependencies and can vectorise loops that are not trivially data parallel as is the case in the post. Allen & Kennedy (MK, 2002) discusses the classical methods.
Here's an example, I'm not sure whether the post's algorithm would handle it:
phase = 0.0
inc = 0.12345678899
for i in [0, n):
out[i] = table[phase % TABLE_LEN]
phase = phase + inc
Assuming n%4 == 0, this can be trivially 4 lane vectorised as:
phasev = [phase, phase+inc, phase+inc+inc, phase+inc+inc+inc]
incv = [inc, inc, inc, inc]
for i in [0, n) step 4:
out[i..i+3] = table[phasev % TABLE_LEN]
phasev = phasev + incv
When I read Allan and Kennedy my impression was that vectorising arbitrary imperative code is a much harder problem than designing a language that only allows for vectorisable constructs to be expressed in the first place. For example maybe it's better to express trivially data parallel kernels as pure functions over buffers and buffer indices. That's how shader programs work isn't it? In my example that would produce different code, requiring a multiplication:
> Can different iterations of this loop run independently?
I'm not a compiler guy, but vectorisation algorithms typically analyze loop-carried dependencies and can vectorise loops that are not trivially data parallel as is the case in the post. Allen & Kennedy (MK, 2002) discusses the classical methods.
Here's an example, I'm not sure whether the post's algorithm would handle it:
Assuming n%4 == 0, this can be trivially 4 lane vectorised as: When I read Allan and Kennedy my impression was that vectorising arbitrary imperative code is a much harder problem than designing a language that only allows for vectorisable constructs to be expressed in the first place. For example maybe it's better to express trivially data parallel kernels as pure functions over buffers and buffer indices. That's how shader programs work isn't it? In my example that would produce different code, requiring a multiplication: