Hi, I just wrote a small bench program that approximates the ticks a specified operation takes and let all vfpu listed on
http://hitmen.c02.at/files/yapspd/psp_doc/chap4.html#sec4.9go through it. Since I found it rather informational, I decided to share the results here. Maybe even some discussion will come from this, so if you have questions on the results, ask, or if you have other information, share it.
Everything was benched with the PSP at default speed, ie 222Mhz, so the ops/µs will increase by 33% when PSP is set to 333Mhz. Tick counts won't change though (tested), so they are reliable. The results also include any latencies induced, so interlacing costly ops with independant other ops might decrease the real tick cost somewhat.
UPDATE: I added ulv.q and usv.q ops for unaligned loads/stores
OP ops/µs ticks/op
vadd.q ~220 1
vsub.q ~220 1
vdot.q ~220 1
vmul.q ~220 1
vhdp.q ~220 1
vdiv.q ~4 56
vmmul.q ~14 16
vmin.q ~220 1
vmax.q ~220 1
vabs.q ~220 1
vneg.q ~220 1
vidt.q ~77 3
vzero.q ~77 3
vone.q ~77 3
vrcp.q ~56 4
vrsq.q ~56 4
vsin.q ~56 4
vcos.q ~56 4
vexp2.q ~56 4
vlog2.q ~56 4
vsqrt.q ~56 4
vasin.q ~56 4
vnrcp.q ~56 4
vnsin.q ~56 4
vrexp2.q ~56 4
vi2uc.q ~220 1
vi2s.q ~220 1
vsgn.q ~220 1
vcst.q ~220 1
vf2in.q ~220 1
vi2f.q ~220 1
vhtfm4.q ~56 4
vtfm4.q ~56 4
vmidt.q ~19 12
vmzero.q ~19 12
lv.q(cache) ~219 1
lv.q(mem) ~4 68
ulv.q(cache)~109 2
ulv.q(mem) ~4 68
sv.q(cache) ~32 7
sv.q(mem) ~2 111
usv.q(cache)~16 14
usv.q(mem) ~2 111
Well, what I can say after this, is that the vector division, apart from memory reads/writes, is the most costly, so avoid that whenever possible. Also doing mem loads/stores from/to cache is to be recommended, so watch your data structures and accesses.
If I find time, I'll maybe also bench the triple, pair and single ops for comparison. Maybe also some comparison to MIPS counterpart ops would be useful (esp for vdiv, vmmul where it's not clear whether vfpu is really faster).
NOTE: If I missed something important, please LMK, I'm basing these results on my current knowledge of op tickcosts and latencies, which might not be 100% correct. So these results are also not warranted for
