Flatmush
Has a normal user title
Administrator
Hero Member
Karma: +84/-26
Offline
Posts: 1046
12906.27 points View InventorySend Money to Flatmush
The Omniscient One
|
 |
« Reply #15 on: September 03, 2008, 02:25:21 AM » |
|
Wow, can't even think how you'd be able to improve on yours raph, nice job.
After reading up on mips assembly I kinda realised half of my cases were absolute rubbish anyway as it cant read or write more than a word at a time.
|
|
|
|
|
Logged
|
Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c    Hehe I'm a "Hero Member" because I bought posts back when they were in the shop. Creator of FlatEditPSP, funcLib and flAstro
|
|
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #16 on: September 03, 2008, 04:18:43 AM » |
|
Well, as said the weakspot is source unaligned copies in range 8-128, so adding special cases in for that would help get closer to daniels implementation in that range and make it pretty much ultimative  There's still that while (size>=16) loop in the vfpu code that writeback invalidates a whole cacheline (64bytes) each loop, which is more than necessary, as it would only need it every fourth iteration. I didn't take the effort to implement it differently and bench it, same as I didn't really bench the benefits from that loop over a straight C version. And yeah, I already wondered why you'd create such *fake* types like u128 and u256 as u32 arrays and try to copy those, seeing how PSP doesn't even support u64 straight. With good compiler optimization it should come out pretty much the same as the normal *dst++ = *src++ loops though, so it at least makes the code a bit more straightforward.
|
|
|
|
« Last Edit: September 03, 2008, 04:33:11 AM by Raphael »
|
Logged
|
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #17 on: September 03, 2008, 07:10:42 AM » |
|
I just found that my previous version of vfpu_memcpy was faulty (at least for dst 64 byte unaligned copies). The reason was, that I was aligning dst on 16 bytes (for the sv.q stores) but at the same time used that address for the cache invalidate, which would then only invalidate the next lower 64 byte aligned address, leaving out bytes at the end which would then not get written correctly. The solution is to do an additional line invalidate before the actual copy loop on the next lower 64 byte aligned address, and then keep invalidating on the next bigger 64 byte aligned addresses in the loop. This will lead to an additional unneeded invalidate for cases where dst is already aligned on 64 bytes now, but the performance drop is not really noticable. Also, I remembered why "ulv.q was bugged with sv.q, wb" - it's not really bugged, it's just that ulv.q only reads qword unaligned addresses that are still word aligned. So I put in another special case when src is only word aligned and not qword aligned and used ulv.q there, which works out pretty nicely. Here's the updated test run results: 1 bytes ( 4 bytes aligned), 100.0%, 91.7%, 69.8%, 100.8% 2 bytes ( 4 bytes aligned), 100.0%, 92.6%, 76.5%, 100.6% 4 bytes ( 4 bytes aligned), 100.0%, 135.3%, 111.3%, 69.5% 8 bytes ( 4 bytes aligned), 100.0%, 125.5%, 170.2%, 150.6% 16 bytes ( 4 bytes aligned), 100.0%, 63.5%, 102.2%, 84.9% 32 bytes ( 4 bytes aligned), 100.0%, 87.6%, 116.0%, 88.8% 64 bytes ( 4 bytes aligned), 100.0%, 90.2%, 116.5%, 149.6% 128 bytes ( 4 bytes aligned), 100.0%, 108.6%, 121.7%, 210.3% 256 bytes ( 4 bytes aligned), 100.0%, 102.9%, 119.1%, 269.0% 512 bytes ( 4 bytes aligned), 100.0%, 109.2%, 118.0%, 319.2% 1024 bytes ( 4 bytes aligned), 100.0%, 112.3%, 117.4%, 354.1% 2048 bytes ( 4 bytes aligned), 100.0%, 117.4%, 116.4%, 380.9% 4096 bytes ( 4 bytes aligned), 100.0%, 116.0%, 117.2%, 391.5% 8192 bytes ( 4 bytes aligned), 100.0%, 117.0%, 116.8%, 399.4% 16384 bytes ( 4 bytes aligned), 100.0%, 103.9%, 103.9%, 1053.8% 32768 bytes ( 4 bytes aligned), 100.0%, 103.8%, 103.9%, 182.3% 65536 bytes ( 4 bytes aligned), 100.0%, 103.8%, 103.9%, 182.4% 1 bytes ( source + 1), 100.0%, 72.9%, 62.0%, 100.8% 2 bytes ( source + 1), 100.0%, 92.1%, 72.1%, 100.6% 4 bytes ( source + 1), 100.0%, 83.5%, 67.4%, 100.9% 8 bytes ( source + 1), 100.0%, 70.4%, 71.7%, 84.5% 16 bytes ( source + 1), 100.0%, 118.2%, 84.2%, 106.4% 32 bytes ( source + 1), 100.0%, 152.0%, 85.6%, 124.0% 64 bytes ( source + 1), 100.0%, 207.8%, 80.6%, 130.5% 128 bytes ( source + 1), 100.0%, 253.1%, 87.9%, 158.2% 256 bytes ( source + 1), 100.0%, 292.3%, 85.8%, 226.1% 512 bytes ( source + 1), 100.0%, 317.2%, 88.0%, 297.1% 1024 bytes ( source + 1), 100.0%, 328.8%, 87.6%, 342.4% 2048 bytes ( source + 1), 100.0%, 332.9%, 87.6%, 373.6% 4096 bytes ( source + 1), 100.0%, 336.2%, 87.6%, 393.3% 8192 bytes ( source + 1), 100.0%, 332.5%, 87.6%, 405.3% 16384 bytes ( source + 1), 100.0%, 228.7%, 90.8%, 556.3% 32768 bytes ( source + 1), 100.0%, 229.0%, 90.7%, 323.0% 65536 bytes ( source + 1), 100.0%, 229.2%, 90.7%, 323.7% 1 bytes ( source + 2), 100.0%, 91.0%, 69.5%, 100.0% 2 bytes ( source + 2), 100.0%, 93.2%, 69.5%, 100.6% 4 bytes ( source + 2), 100.0%, 83.8%, 91.2%, 100.4% 8 bytes ( source + 2), 100.0%, 89.6%, 110.2%, 89.0% 16 bytes ( source + 2), 100.0%, 122.0%, 132.2%, 109.6% 32 bytes ( source + 2), 100.0%, 156.7%, 148.2%, 133.2% 64 bytes ( source + 2), 100.0%, 214.7%, 148.3%, 133.7% 128 bytes ( source + 2), 100.0%, 258.6%, 156.6%, 157.0% 256 bytes ( source + 2), 100.0%, 298.8%, 174.3%, 227.0% 512 bytes ( source + 2), 100.0%, 318.1%, 172.7%, 297.1% 1024 bytes ( source + 2), 100.0%, 325.6%, 174.5%, 339.9% 2048 bytes ( source + 2), 100.0%, 333.1%, 174.8%, 372.6% 4096 bytes ( source + 2), 100.0%, 335.4%, 175.1%, 393.6% 8192 bytes ( source + 2), 100.0%, 333.3%, 174.1%, 405.2% 16384 bytes ( source + 2), 100.0%, 227.7%, 144.1%, 552.9% 32768 bytes ( source + 2), 100.0%, 228.1%, 144.1%, 321.6% 65536 bytes ( source + 2), 100.0%, 228.3%, 144.1%, 322.5% 1 bytes ( source + 3), 100.0%, 91.7%, 55.9%, 100.8% 2 bytes ( source + 3), 100.0%, 92.6%, 72.1%, 100.6% 4 bytes ( source + 3), 100.0%, 83.8%, 75.9%, 100.4% 8 bytes ( source + 3), 100.0%, 95.9%, 79.5%, 92.9% 16 bytes ( source + 3), 100.0%, 129.4%, 84.1%, 113.4% 32 bytes ( source + 3), 100.0%, 161.8%, 85.5%, 124.0% 64 bytes ( source + 3), 100.0%, 221.0%, 88.0%, 135.7% 128 bytes ( source + 3), 100.0%, 267.7%, 88.4%, 161.6% 256 bytes ( source + 3), 100.0%, 291.8%, 85.7%, 225.1% 512 bytes ( source + 3), 100.0%, 312.7%, 87.9%, 297.1% 1024 bytes ( source + 3), 100.0%, 325.8%, 87.6%, 340.1% 2048 bytes ( source + 3), 100.0%, 333.5%, 87.6%, 374.2% 4096 bytes ( source + 3), 100.0%, 336.6%, 87.6%, 393.8% 8192 bytes ( source + 3), 100.0%, 332.8%, 87.7%, 405.2% 16384 bytes ( source + 3), 100.0%, 224.5%, 90.9%, 545.1% 32768 bytes ( source + 3), 100.0%, 224.8%, 90.9%, 316.9% 65536 bytes ( source + 3), 100.0%, 224.9%, 90.9%, 317.6% 1 bytes ( dest + 1), 100.0%, 91.0%, 69.5%, 100.8% 2 bytes ( dest + 1), 100.0%, 92.6%, 62.2%, 100.6% 4 bytes ( dest + 1), 100.0%, 119.5%, 76.1%, 143.8% 8 bytes ( dest + 1), 100.0%, 98.6%, 64.0%, 113.1% 16 bytes ( dest + 1), 100.0%, 139.3%, 75.0%, 137.0% 32 bytes ( dest + 1), 100.0%, 204.6%, 76.0%, 140.0% 64 bytes ( dest + 1), 100.0%, 256.1%, 82.5%, 268.1% 128 bytes ( dest + 1), 100.0%, 287.8%, 80.9%, 321.1% 256 bytes ( dest + 1), 100.0%, 317.3%, 87.0%, 368.6% 512 bytes ( dest + 1), 100.0%, 328.4%, 87.2%, 389.9% 1024 bytes ( dest + 1), 100.0%, 334.7%, 87.2%, 402.5% 2048 bytes ( dest + 1), 100.0%, 335.8%, 87.5%, 407.0% 4096 bytes ( dest + 1), 100.0%, 336.9%, 87.5%, 412.2% 8192 bytes ( dest + 1), 100.0%, 333.6%, 87.6%, 416.8% 16384 bytes ( dest + 1), 100.0%, 215.6%, 90.0%, 529.1% 32768 bytes ( dest + 1), 100.0%, 215.8%, 90.0%, 326.4% 65536 bytes ( dest + 1), 100.0%, 216.0%, 90.0%, 326.7% 1 bytes ( dest + 2), 100.0%, 94.4%, 48.4%, 104.6% 2 bytes ( dest + 2), 100.0%, 94.9%, 78.4%, 103.1% 4 bytes ( dest + 2), 100.0%, 85.6%, 81.4%, 103.1% 8 bytes ( dest + 2), 100.0%, 104.7%, 90.9%, 114.4% 16 bytes ( dest + 2), 100.0%, 138.0%, 109.5%, 130.2% 32 bytes ( dest + 2), 100.0%, 212.2%, 127.5%, 140.3% 64 bytes ( dest + 2), 100.0%, 257.6%, 146.3%, 263.7% 128 bytes ( dest + 2), 100.0%, 291.9%, 150.9%, 313.2% 256 bytes ( dest + 2), 100.0%, 319.8%, 168.4%, 368.7% 512 bytes ( dest + 2), 100.0%, 329.8%, 170.6%, 389.9% 1024 bytes ( dest + 2), 100.0%, 331.2%, 173.2%, 402.5% 2048 bytes ( dest + 2), 100.0%, 336.2%, 174.3%, 407.1% 4096 bytes ( dest + 2), 100.0%, 338.0%, 174.4%, 411.3% 8192 bytes ( dest + 2), 100.0%, 335.9%, 174.4%, 415.5% 16384 bytes ( dest + 2), 100.0%, 215.6%, 149.0%, 533.0% 32768 bytes ( dest + 2), 100.0%, 215.9%, 149.2%, 326.3% 65536 bytes ( dest + 2), 100.0%, 216.0%, 149.3%, 326.6% 1 bytes ( dest + 3), 100.0%, 106.9%, 81.5%, 117.6% 2 bytes ( dest + 3), 100.0%, 105.1%, 71.0%, 114.8% 4 bytes ( dest + 3), 100.0%, 91.9%, 67.2%, 111.1% 8 bytes ( dest + 3), 100.0%, 107.4%, 58.3%, 120.1% 16 bytes ( dest + 3), 100.0%, 116.8%, 67.3%, 134.2% 32 bytes ( dest + 3), 100.0%, 211.8%, 77.6%, 142.7% 64 bytes ( dest + 3), 100.0%, 257.3%, 82.0%, 266.5% 128 bytes ( dest + 3), 100.0%, 266.2%, 83.9%, 323.4% 256 bytes ( dest + 3), 100.0%, 309.5%, 85.2%, 364.1% 512 bytes ( dest + 3), 100.0%, 322.0%, 87.3%, 390.5% 1024 bytes ( dest + 3), 100.0%, 335.1%, 87.3%, 401.9% 2048 bytes ( dest + 3), 100.0%, 336.0%, 87.5%, 407.1% 4096 bytes ( dest + 3), 100.0%, 337.3%, 87.5%, 411.4% 8192 bytes ( dest + 3), 100.0%, 337.6%, 87.8%, 416.8% 16384 bytes ( dest + 3), 100.0%, 215.8%, 90.2%, 532.6% 32768 bytes ( dest + 3), 100.0%, 215.9%, 90.2%, 326.3% 65536 bytes ( dest + 3), 100.0%, 216.0%, 90.2%, 326.6% 1 bytes (source + 1, dest + 3), 100.0%, 108.3%, 73.7%, 120.8% 2 bytes (source + 1, dest + 3), 100.0%, 105.6%, 82.7%, 114.7% 4 bytes (source + 1, dest + 3), 100.0%, 92.3%, 83.6%, 95.1% 8 bytes (source + 1, dest + 3), 100.0%, 93.3%, 84.9%, 90.6% 16 bytes (source + 1, dest + 3), 100.0%, 124.5%, 71.1%, 110.0% 32 bytes (source + 1, dest + 3), 100.0%, 149.9%, 87.4%, 126.5% 64 bytes (source + 1, dest + 3), 100.0%, 214.7%, 87.4%, 134.1% 128 bytes (source + 1, dest + 3), 100.0%, 258.5%, 86.8%, 157.5% 256 bytes (source + 1, dest + 3), 100.0%, 294.1%, 86.6%, 226.6% 512 bytes (source + 1, dest + 3), 100.0%, 314.5%, 87.5%, 296.8% 1024 bytes (source + 1, dest + 3), 100.0%, 329.3%, 87.7%, 343.0% 2048 bytes (source + 1, dest + 3), 100.0%, 333.2%, 87.6%, 373.9% 4096 bytes (source + 1, dest + 3), 100.0%, 335.8%, 87.6%, 393.4% 8192 bytes (source + 1, dest + 3), 100.0%, 329.8%, 87.7%, 407.2% 16384 bytes (source + 1, dest + 3), 100.0%, 215.1%, 90.0%, 520.0% 32768 bytes (source + 1, dest + 3), 100.0%, 215.5%, 90.0%, 300.6% 65536 bytes (source + 1, dest + 3), 100.0%, 215.8%, 89.9%, 301.4% 1 bytes (source + 2, dest + 2), 100.0%, 66.7%, 71.6%, 104.6% 2 bytes (source + 2, dest + 2), 100.0%, 95.5%, 61.3%, 103.7% 4 bytes (source + 2, dest + 2), 100.0%, 85.2%, 92.4%, 102.7% 8 bytes (source + 2, dest + 2), 100.0%, 103.2%, 110.8%, 92.7% 16 bytes (source + 2, dest + 2), 100.0%, 119.9%, 119.7%, 132.8% 32 bytes (source + 2, dest + 2), 100.0%, 164.2%, 139.6%, 204.0% 64 bytes (source + 2, dest + 2), 100.0%, 287.5%, 160.3%, 148.6% 128 bytes (source + 2, dest + 2), 100.0%, 402.3%, 172.6%, 226.2% 256 bytes (source + 2, dest + 2), 100.0%, 486.8%, 174.4%, 401.6% 512 bytes (source + 2, dest + 2), 100.0%, 546.2%, 173.1%, 679.1% 1024 bytes (source + 2, dest + 2), 100.0%, 583.0%, 174.6%, 964.1% 2048 bytes (source + 2, dest + 2), 100.0%, 596.7%, 174.9%, 1345.4% 4096 bytes (source + 2, dest + 2), 100.0%, 610.7%, 174.9%, 1655.4% 8192 bytes (source + 2, dest + 2), 100.0%, 585.8%, 173.4%, 1841.0% 16384 bytes (source + 2, dest + 2), 100.0%, 268.8%, 148.9%, 2306.2% 32768 bytes (source + 2, dest + 2), 100.0%, 269.3%, 149.1%, 472.2% 65536 bytes (source + 2, dest + 2), 100.0%, 269.7%, 149.3%, 473.6% 1 bytes (source + 3, dest + 1), 100.0%, 91.0%, 55.9%, 100.8% 2 bytes (source + 3, dest + 1), 100.0%, 92.6%, 72.1%, 100.6% 4 bytes (source + 3, dest + 1), 100.0%, 83.5%, 75.7%, 100.0% 8 bytes (source + 3, dest + 1), 100.0%, 100.9%, 79.5%, 92.7% 16 bytes (source + 3, dest + 1), 100.0%, 134.2%, 84.0%, 113.0% 32 bytes (source + 3, dest + 1), 100.0%, 165.9%, 75.8%, 129.1% 64 bytes (source + 3, dest + 1), 100.0%, 220.8%, 81.9%, 133.1% 128 bytes (source + 3, dest + 1), 100.0%, 259.7%, 84.5%, 159.0% 256 bytes (source + 3, dest + 1), 100.0%, 303.7%, 88.6%, 226.0% 512 bytes (source + 3, dest + 1), 100.0%, 320.8%, 87.9%, 296.9% 1024 bytes (source + 3, dest + 1), 100.0%, 326.8%, 87.8%, 343.9% 2048 bytes (source + 3, dest + 1), 100.0%, 333.8%, 87.7%, 373.0% 4096 bytes (source + 3, dest + 1), 100.0%, 336.7%, 87.6%, 394.0% 8192 bytes (source + 3, dest + 1), 100.0%, 330.3%, 87.6%, 407.1% 16384 bytes (source + 3, dest + 1), 100.0%, 227.6%, 91.4%, 549.4% 32768 bytes (source + 3, dest + 1), 100.0%, 228.1%, 91.3%, 321.6% 65536 bytes (source + 3, dest + 1), 100.0%, 228.3%, 91.2%, 322.4%
Notice the 4byte aligned copy result of 64% only, even though the code for that size wasn't touched at all. That's the sporadic inconsistencies I was talking about earlier (My best quess would be that PSPLink somehow interferes... maybe it would be a good idea to disable interrupts during the runs). I also updated the code in my previous post with the new code for vfpu_memcpy.
|
|
|
|
|
Logged
|
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #18 on: September 03, 2008, 07:19:19 AM » |
|
I just redid the test with disabled interrupts and it seems that really helps to steady up the results: 1 bytes ( 4 bytes aligned), 100.0%, 91.7%, 69.8%, 100.8% 2 bytes ( 4 bytes aligned), 100.0%, 92.7%, 77.4%, 100.6% 4 bytes ( 4 bytes aligned), 100.0%, 135.9%, 111.8%, 100.4% 8 bytes ( 4 bytes aligned), 100.0%, 125.5%, 170.2%, 150.6% 16 bytes ( 4 bytes aligned), 100.0%, 63.5%, 102.2%, 85.2% 32 bytes ( 4 bytes aligned), 100.0%, 87.9%, 117.1%, 89.6% 64 bytes ( 4 bytes aligned), 100.0%, 96.5%, 116.7%, 149.0% 128 bytes ( 4 bytes aligned), 100.0%, 104.3%, 116.8%, 202.5% 256 bytes ( 4 bytes aligned), 100.0%, 109.6%, 116.8%, 264.1% 512 bytes ( 4 bytes aligned), 100.0%, 112.9%, 116.6%, 316.3% 1024 bytes ( 4 bytes aligned), 100.0%, 114.6%, 116.6%, 350.1% 2048 bytes ( 4 bytes aligned), 100.0%, 115.6%, 116.6%, 374.2% 4096 bytes ( 4 bytes aligned), 100.0%, 116.1%, 116.6%, 386.8% 8192 bytes ( 4 bytes aligned), 100.0%, 116.4%, 116.6%, 393.7% 16384 bytes ( 4 bytes aligned), 100.0%, 103.8%, 103.8%, 1061.8% 32768 bytes ( 4 bytes aligned), 100.0%, 103.8%, 103.8%, 182.1% 65536 bytes ( 4 bytes aligned), 100.0%, 103.8%, 103.8%, 182.3% 1 bytes ( source + 1), 100.0%, 91.7%, 62.7%, 102.3% 2 bytes ( source + 1), 100.0%, 93.2%, 72.6%, 101.2% 4 bytes ( source + 1), 100.0%, 83.8%, 76.2%, 100.9% 8 bytes ( source + 1), 100.0%, 86.2%, 80.4%, 85.4% 16 bytes ( source + 1), 100.0%, 118.2%, 84.0%, 106.0% 32 bytes ( source + 1), 100.0%, 152.0%, 85.5%, 123.8% 64 bytes ( source + 1), 100.0%, 207.6%, 86.5%, 130.7% 128 bytes ( source + 1), 100.0%, 255.9%, 87.0%, 156.8% 256 bytes ( source + 1), 100.0%, 291.3%, 87.2%, 225.0% 512 bytes ( source + 1), 100.0%, 313.1%, 87.4%, 293.3% 1024 bytes ( source + 1), 100.0%, 325.6%, 87.4%, 339.9% 2048 bytes ( source + 1), 100.0%, 332.2%, 87.5%, 373.2% 4096 bytes ( source + 1), 100.0%, 335.6%, 87.5%, 392.8% 8192 bytes ( source + 1), 100.0%, 332.6%, 87.6%, 403.7% 16384 bytes ( source + 1), 100.0%, 229.0%, 90.7%, 558.9% 32768 bytes ( source + 1), 100.0%, 229.3%, 90.7%, 323.4% 65536 bytes ( source + 1), 100.0%, 229.5%, 90.7%, 324.1% 1 bytes ( source + 2), 100.0%, 91.7%, 69.5%, 100.8% 2 bytes ( source + 2), 100.0%, 93.2%, 69.8%, 101.2% 4 bytes ( source + 2), 100.0%, 83.8%, 91.6%, 100.9% 8 bytes ( source + 2), 100.0%, 89.2%, 110.3%, 88.7% 16 bytes ( source + 2), 100.0%, 122.2%, 132.4%, 110.0% 32 bytes ( source + 2), 100.0%, 155.2%, 148.7%, 126.6% 64 bytes ( source + 2), 100.0%, 211.1%, 160.1%, 131.5% 128 bytes ( source + 2), 100.0%, 258.7%, 167.0%, 157.3% 256 bytes ( source + 2), 100.0%, 293.0%, 170.8%, 225.2% 512 bytes ( source + 2), 100.0%, 314.1%, 172.8%, 293.5% 1024 bytes ( source + 2), 100.0%, 326.1%, 173.9%, 340.7% 2048 bytes ( source + 2), 100.0%, 332.5%, 174.4%, 373.5% 4096 bytes ( source + 2), 100.0%, 335.8%, 174.7%, 392.8% 8192 bytes ( source + 2), 100.0%, 332.8%, 174.0%, 403.7% 16384 bytes ( source + 2), 100.0%, 228.0%, 144.1%, 556.5% 32768 bytes ( source + 2), 100.0%, 228.5%, 144.1%, 322.2% 65536 bytes ( source + 2), 100.0%, 228.7%, 144.1%, 322.9% 1 bytes ( source + 3), 100.0%, 91.0%, 55.9%, 100.8% 2 bytes ( source + 3), 100.0%, 93.2%, 72.6%, 101.2% 4 bytes ( source + 3), 100.0%, 83.8%, 76.2%, 100.9% 8 bytes ( source + 3), 100.0%, 95.9%, 79.9%, 92.9% 16 bytes ( source + 3), 100.0%, 129.6%, 84.3%, 113.8% 32 bytes ( source + 3), 100.0%, 161.8%, 85.5%, 129.2% 64 bytes ( source + 3), 100.0%, 217.1%, 86.5%, 133.3% 128 bytes ( source + 3), 100.0%, 262.9%, 87.0%, 158.9% 256 bytes ( source + 3), 100.0%, 295.9%, 87.2%, 225.3% 512 bytes ( source + 3), 100.0%, 315.9%, 87.4%, 293.6% 1024 bytes ( source + 3), 100.0%, 327.0%, 87.4%, 341.7% 2048 bytes ( source + 3), 100.0%, 333.0%, 87.5%, 373.8% 4096 bytes ( source + 3), 100.0%, 336.0%, 87.5%, 393.2% 8192 bytes ( source + 3), 100.0%, 335.0%, 87.5%, 402.3% 16384 bytes ( source + 3), 100.0%, 225.0%, 90.8%, 552.3% 32768 bytes ( source + 3), 100.0%, 225.2%, 90.9%, 317.4% 65536 bytes ( source + 3), 100.0%, 225.3%, 90.9%, 318.0% 1 bytes ( dest + 1), 100.0%, 92.4%, 70.9%, 103.1% 2 bytes ( dest + 1), 100.0%, 93.2%, 62.1%, 100.6% 4 bytes ( dest + 1), 100.0%, 83.8%, 61.2%, 100.4% 8 bytes ( dest + 1), 100.0%, 98.3%, 63.9%, 113.1% 16 bytes ( dest + 1), 100.0%, 132.3%, 71.0%, 129.8% 32 bytes ( dest + 1), 100.0%, 204.6%, 76.1%, 140.0% 64 bytes ( dest + 1), 100.0%, 252.1%, 81.1%, 264.3% 128 bytes ( dest + 1), 100.0%, 288.1%, 84.1%, 321.2% 256 bytes ( dest + 1), 100.0%, 311.1%, 85.7%, 361.5% 512 bytes ( dest + 1), 100.0%, 324.4%, 86.6%, 388.4% 1024 bytes ( dest + 1), 100.0%, 331.6%, 87.1%, 399.1% 2048 bytes ( dest + 1), 100.0%, 335.3%, 87.3%, 406.8% 4096 bytes ( dest + 1), 100.0%, 337.2%, 87.4%, 410.7% 8192 bytes ( dest + 1), 100.0%, 335.8%, 87.5%, 413.7% 16384 bytes ( dest + 1), 100.0%, 215.6%, 90.0%, 534.7% 32768 bytes ( dest + 1), 100.0%, 215.8%, 90.0%, 326.3% 65536 bytes ( dest + 1), 100.0%, 215.9%, 90.0%, 326.6% 1 bytes ( dest + 2), 100.0%, 91.0%, 69.5%, 100.8% 2 bytes ( dest + 2), 100.0%, 93.2%, 77.5%, 101.9% 4 bytes ( dest + 2), 100.0%, 83.8%, 79.4%, 100.4% 8 bytes ( dest + 2), 100.0%, 103.8%, 89.9%, 113.4% 16 bytes ( dest + 2), 100.0%, 136.9%, 108.6%, 129.4% 32 bytes ( dest + 2), 100.0%, 211.1%, 127.3%, 140.1% 64 bytes ( dest + 2), 100.0%, 256.9%, 146.1%, 263.8% 128 bytes ( dest + 2), 100.0%, 291.7%, 158.8%, 321.0% 256 bytes ( dest + 2), 100.0%, 313.2%, 166.4%, 361.4% 512 bytes ( dest + 2), 100.0%, 325.5%, 170.5%, 385.2% 1024 bytes ( dest + 2), 100.0%, 332.2%, 172.7%, 400.3% 2048 bytes ( dest + 2), 100.0%, 335.6%, 173.8%, 406.8% 4096 bytes ( dest + 2), 100.0%, 337.4%, 174.4%, 410.7% 8192 bytes ( dest + 2), 100.0%, 335.8%, 174.2%, 413.5% 16384 bytes ( dest + 2), 100.0%, 215.6%, 149.1%, 534.6% 32768 bytes ( dest + 2), 100.0%, 215.9%, 149.2%, 326.4% 65536 bytes ( dest + 2), 100.0%, 216.0%, 149.3%, 326.7% 1 bytes ( dest + 3), 100.0%, 94.4%, 71.6%, 103.0% 2 bytes ( dest + 3), 100.0%, 95.5%, 64.3%, 103.7% 4 bytes ( dest + 3), 100.0%, 84.9%, 62.1%, 101.8% 8 bytes ( dest + 3), 100.0%, 102.9%, 64.9%, 115.1% 16 bytes ( dest + 3), 100.0%, 135.3%, 71.4%, 130.5% 32 bytes ( dest + 3), 100.0%, 208.9%, 76.4%, 140.6% 64 bytes ( dest + 3), 100.0%, 255.0%, 81.3%, 265.1% 128 bytes ( dest + 3), 100.0%, 290.1%, 84.2%, 322.1% 256 bytes ( dest + 3), 100.0%, 312.3%, 85.8%, 365.2% 512 bytes ( dest + 3), 100.0%, 325.1%, 86.6%, 385.3% 1024 bytes ( dest + 3), 100.0%, 331.9%, 87.1%, 399.3% 2048 bytes ( dest + 3), 100.0%, 335.5%, 87.3%, 406.8% 4096 bytes ( dest + 3), 100.0%, 337.3%, 87.4%, 410.8% 8192 bytes ( dest + 3), 100.0%, 335.8%, 87.5%, 413.7% 16384 bytes ( dest + 3), 100.0%, 215.7%, 90.1%, 534.8% 32768 bytes ( dest + 3), 100.0%, 215.9%, 90.2%, 326.4% 65536 bytes ( dest + 3), 100.0%, 216.0%, 90.2%, 326.7% 1 bytes (source + 1, dest + 3), 100.0%, 93.2%, 63.8%, 103.8% 2 bytes (source + 1, dest + 3), 100.0%, 94.9%, 74.3%, 103.7% 4 bytes (source + 1, dest + 3), 100.0%, 85.3%, 77.6%, 102.7% 8 bytes (source + 1, dest + 3), 100.0%, 88.4%, 80.6%, 85.9% 16 bytes (source + 1, dest + 3), 100.0%, 120.8%, 84.7%, 107.1% 32 bytes (source + 1, dest + 3), 100.0%, 154.3%, 85.9%, 124.5% 64 bytes (source + 1, dest + 3), 100.0%, 209.6%, 86.6%, 131.2% 128 bytes (source + 1, dest + 3), 100.0%, 257.4%, 87.1%, 156.9% 256 bytes (source + 1, dest + 3), 100.0%, 292.2%, 87.3%, 225.3% 512 bytes (source + 1, dest + 3), 100.0%, 313.7%, 87.4%, 293.5% 1024 bytes (source + 1, dest + 3), 100.0%, 325.9%, 87.4%, 339.7% 2048 bytes (source + 1, dest + 3), 100.0%, 332.4%, 87.5%, 373.5% 4096 bytes (source + 1, dest + 3), 100.0%, 335.7%, 87.5%, 392.8% 8192 bytes (source + 1, dest + 3), 100.0%, 329.7%, 87.6%, 405.5% 16384 bytes (source + 1, dest + 3), 100.0%, 215.0%, 90.0%, 521.9% 32768 bytes (source + 1, dest + 3), 100.0%, 215.5%, 89.9%, 300.7% 65536 bytes (source + 1, dest + 3), 100.0%, 215.8%, 89.9%, 301.4% 1 bytes (source + 2, dest + 2), 100.0%, 91.7%, 69.5%, 101.5% 2 bytes (source + 2, dest + 2), 100.0%, 92.1%, 69.4%, 100.6% 4 bytes (source + 2, dest + 2), 100.0%, 83.5%, 90.8%, 100.9% 8 bytes (source + 2, dest + 2), 100.0%, 102.3%, 109.9%, 91.7% 16 bytes (source + 2, dest + 2), 100.0%, 143.8%, 131.8%, 132.1% 32 bytes (source + 2, dest + 2), 100.0%, 190.0%, 148.2%, 203.4% 64 bytes (source + 2, dest + 2), 100.0%, 287.0%, 160.0%, 149.7% 128 bytes (source + 2, dest + 2), 100.0%, 388.5%, 166.9%, 223.0% 256 bytes (source + 2, dest + 2), 100.0%, 477.0%, 170.8%, 402.5% 512 bytes (source + 2, dest + 2), 100.0%, 539.0%, 172.8%, 670.3% 1024 bytes (source + 2, dest + 2), 100.0%, 576.9%, 173.9%, 1000.7% 2048 bytes (source + 2, dest + 2), 100.0%, 598.2%, 174.4%, 1357.3% 4096 bytes (source + 2, dest + 2), 100.0%, 609.4%, 174.7%, 1661.9% 8192 bytes (source + 2, dest + 2), 100.0%, 583.9%, 173.5%, 1826.6% 16384 bytes (source + 2, dest + 2), 100.0%, 268.7%, 148.9%, 2347.5% 32768 bytes (source + 2, dest + 2), 100.0%, 269.3%, 149.2%, 472.0% 65536 bytes (source + 2, dest + 2), 100.0%, 269.7%, 149.3%, 473.4% 1 bytes (source + 3, dest + 1), 100.0%, 91.0%, 55.9%, 100.8% 2 bytes (source + 3, dest + 1), 100.0%, 92.1%, 71.8%, 100.6% 4 bytes (source + 3, dest + 1), 100.0%, 84.1%, 75.9%, 100.9% 8 bytes (source + 3, dest + 1), 100.0%, 100.9%, 79.9%, 92.9% 16 bytes (source + 3, dest + 1), 100.0%, 134.2%, 84.2%, 113.4% 32 bytes (source + 3, dest + 1), 100.0%, 166.1%, 85.5%, 129.2% 64 bytes (source + 3, dest + 1), 100.0%, 220.9%, 86.5%, 133.5% 128 bytes (source + 3, dest + 1), 100.0%, 266.0%, 86.9%, 159.0% 256 bytes (source + 3, dest + 1), 100.0%, 297.8%, 87.2%, 227.6% 512 bytes (source + 3, dest + 1), 100.0%, 316.8%, 87.4%, 293.4% 1024 bytes (source + 3, dest + 1), 100.0%, 327.6%, 87.4%, 341.5% 2048 bytes (source + 3, dest + 1), 100.0%, 333.3%, 87.5%, 373.9% 4096 bytes (source + 3, dest + 1), 100.0%, 336.2%, 87.5%, 393.2% 8192 bytes (source + 3, dest + 1), 100.0%, 332.0%, 87.6%, 404.2% 16384 bytes (source + 3, dest + 1), 100.0%, 228.1%, 91.3%, 556.8% 32768 bytes (source + 3, dest + 1), 100.0%, 228.5%, 91.3%, 322.1% 65536 bytes (source + 3, dest + 1), 100.0%, 228.7%, 91.2%, 322.9%
|
|
|
|
|
Logged
|
|
|
|
Flatmush
Has a normal user title
Administrator
Hero Member
Karma: +84/-26
Offline
Posts: 1046
12906.27 points View InventorySend Money to Flatmush
The Omniscient One
|
 |
« Reply #19 on: September 03, 2008, 10:06:19 AM » |
|
I think I realize why I got a speedup with the invented types actually, cause I did do a benchmark and the types did make it faster for some reason. I'm pretty sure its because when optimized it does more copies in the loop, so the overhead of the branch is much less.
i.e for the u256, it would do 8 word copies inside the loop for every iteration thus the branch overhead is much less, probably only takes effect when the data is in the cache though.
|
|
|
|
« Last Edit: September 03, 2008, 10:15:19 AM by Flatmush »
|
Logged
|
Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c    Hehe I'm a "Hero Member" because I bought posts back when they were in the shop. Creator of FlatEditPSP, funcLib and flAstro
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #20 on: September 03, 2008, 02:26:15 PM » |
|
Hi Raphael, I added your changes to my code, and got the following results memcpy daniel flatmush raphael libc
1 bytes ( 64 bytes aligned), 100.0%, 19.0%, 14.1%, 20.7%, 20.8% 2 bytes ( 64 bytes aligned), 100.0%, 25.1%, 20.8%, 27.2%, 27.2% 4 bytes ( 64 bytes aligned), 100.0%, 16.1%, 20.2%, 16.1%, 16.1% 8 bytes ( 64 bytes aligned), 100.0%, 20.6%, 37.6%, 32.8%, 26.1% 16 bytes ( 64 bytes aligned), 100.0%, 31.2%, 62.6%, 44.4%, 55.1% 32 bytes ( 64 bytes aligned), 100.0%, 59.3%, 99.3%, 68.9%, 81.2% 64 bytes ( 64 bytes aligned), 100.0%, 111.0%, 159.6%, 193.1%, 131.9% 128 bytes ( 64 bytes aligned), 100.0%, 145.9%, 182.9%, 305.8%, 153.9% 256 bytes ( 64 bytes aligned), 100.0%, 174.1%, 198.3%, 431.5%, 168.1% 512 bytes ( 64 bytes aligned), 100.0%, 192.8%, 206.9%, 547.8%, 176.4% 1024 bytes ( 64 bytes aligned), 100.0%, 204.0%, 211.6%, 624.2%, 180.9% 2048 bytes ( 64 bytes aligned), 100.0%, 210.0%, 214.0%, 681.2%, 183.2% 4096 bytes ( 64 bytes aligned), 100.0%, 213.2%, 215.2%, 713.9%, 184.4% 8192 bytes ( 64 bytes aligned), 100.0%, 216.2%, 217.2%, 733.7%, 186.2% 16384 bytes ( 64 bytes aligned), 100.0%, 137.6%, 137.8%, 1408.2%, 132.2% 32768 bytes ( 64 bytes aligned), 100.0%, 127.8%, 128.0%, 224.5%, 122.7% 65536 bytes ( 64 bytes aligned), 100.0%, 127.8%, 128.0%, 224.7%, 122.7%
Notes: 1.) I'm added one column libc (the libc code you added to the source) 2.) Changed the alignment to 64 bytes 3.) I didn't managed to link with sceKernelIcacheInvalidateAll(), so I skipped it 4.) My code is in a C++ file (but I tried to put Flatmush code in a seperated c file before and got the same results) 5.) I added checks for 8,16,32 bytes alignment (base + 8,16,32), but got the almost same results 6.) The first column uses the default memcpy(), and I see that the compiler is doing some sort of inlining here when size < 64 bytes, see below memcpy's using a (volatile) size variable instead of a constant value memcpy daniel flatmush raphael libc
1 bytes ( 64 bytes aligned), 100.0%, 108.0%, 80.7%, 118.1%, 118.1% 2 bytes ( 64 bytes aligned), 100.0%, 107.1%, 88.3%, 114.6%, 115.2% 4 bytes ( 64 bytes aligned), 100.0%, 75.3%, 94.0%, 74.7%, 75.0% 8 bytes ( 64 bytes aligned), 100.0%, 60.7%, 109.6%, 96.4%, 76.6% 16 bytes ( 64 bytes aligned), 100.0%, 52.0%, 103.3%, 73.8%, 91.2% 32 bytes ( 64 bytes aligned), 100.0%, 68.3%, 113.7%, 79.1%, 92.6% 64 bytes ( 64 bytes aligned), 100.0%, 78.7%, 112.8%, 136.5%, 93.4% 128 bytes ( 64 bytes aligned), 100.0%, 89.4%, 111.9%, 186.3%, 94.2% 256 bytes ( 64 bytes aligned), 100.0%, 98.0%, 111.5%, 242.2%, 94.6% 512 bytes ( 64 bytes aligned), 100.0%, 103.8%, 111.3%, 293.9%, 94.9% 1024 bytes ( 64 bytes aligned), 100.0%, 107.2%, 111.2%, 328.4%, 95.1% 2048 bytes ( 64 bytes aligned), 100.0%, 109.1%, 111.1%, 354.3%, 95.2% 4096 bytes ( 64 bytes aligned), 100.0%, 110.1%, 111.1%, 367.5%, 95.2% 8192 bytes ( 64 bytes aligned), 100.0%, 110.4%, 110.9%, 381.3%, 95.3% 16384 bytes ( 64 bytes aligned), 100.0%, 103.9%, 104.0%, 1017.5%, 99.7% 32768 bytes ( 64 bytes aligned), 100.0%, 103.9%, 104.1%, 182.6%, 99.8% 65536 bytes ( 64 bytes aligned), 100.0%, 103.9%, 104.1%, 182.7%, 99.8%
[EDIT] code change #define COPY_LOOP(d, s, a) \ { \ volatile int bytes = 1; \ COPY(d, s, bytes, a) bytes*=2; \ COPY(d, s, bytes, a) bytes*=2; \ etc. \ }
Noware
|
|
|
|
« Last Edit: September 03, 2008, 02:45:53 PM by Noware »
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #21 on: September 03, 2008, 02:55:17 PM » |
|
Hi Raphael, I added your changes to my code, and got the following results Notes: 1.) I'm added one column libc (the libc code you added to the source) 2.) Changed the alignment to 64 bytes 3.) I didn't managed to link with sceKernelIcacheInvalidateAll(), so I skipped it 4.) My code is in a C++ file (but I tried to put Flatmush code in a seperated c file before and got the same results) 5.) I added checks for 8,16,32 bytes alignment (base + 8,16,32), but got the almost same results 6.) The first column uses the default memcpy(), and I see that the compiler is doing some sort of inlining here when size < 64 bytes, see below memopy using a (volatile) size variable instead of a constant value memcpy daniel flatmush raphael libc
1 bytes ( 64 bytes aligned), 100.0%, 108.0%, 80.7%, 118.1%, 118.1% 2 bytes ( 64 bytes aligned), 100.0%, 107.1%, 88.3%, 114.6%, 115.2% 4 bytes ( 64 bytes aligned), 100.0%, 75.3%, 94.0%, 74.7%, 75.0% 8 bytes ( 64 bytes aligned), 100.0%, 60.7%, 109.6%, 96.4%, 76.6% 16 bytes ( 64 bytes aligned), 100.0%, 52.0%, 103.3%, 73.8%, 91.2% 32 bytes ( 64 bytes aligned), 100.0%, 68.3%, 113.7%, 79.1%, 92.6% 64 bytes ( 64 bytes aligned), 100.0%, 78.7%, 112.8%, 136.5%, 93.4% 128 bytes ( 64 bytes aligned), 100.0%, 89.4%, 111.9%, 186.3%, 94.2% 256 bytes ( 64 bytes aligned), 100.0%, 98.0%, 111.5%, 242.2%, 94.6% 512 bytes ( 64 bytes aligned), 100.0%, 103.8%, 111.3%, 293.9%, 94.9% 1024 bytes ( 64 bytes aligned), 100.0%, 107.2%, 111.2%, 328.4%, 95.1% 2048 bytes ( 64 bytes aligned), 100.0%, 109.1%, 111.1%, 354.3%, 95.2% 4096 bytes ( 64 bytes aligned), 100.0%, 110.1%, 111.1%, 367.5%, 95.2% 8192 bytes ( 64 bytes aligned), 100.0%, 110.4%, 110.9%, 381.3%, 95.3% 16384 bytes ( 64 bytes aligned), 100.0%, 103.9%, 104.0%, 1017.5%, 99.7% 32768 bytes ( 64 bytes aligned), 100.0%, 103.9%, 104.1%, 182.6%, 99.8% 65536 bytes ( 64 bytes aligned), 100.0%, 103.9%, 104.1%, 182.7%, 99.8%
Noware 1. That's the code I c&p from newlib  3. That is declared in psputilsforkernel.h and linked in with -lpspkernel 4. should still work  5. That's to be expected, as the most implementations work fine for any multiple of 4 byte alignment (inner loop consists of 32bit copies). The other cases where bigger alignment is of benefit (in my implementation) only kicks in for bigger copies, where the overhead of checking alignment/realigning is ruled by the mem instructions latencies. 6. Yep, that's why I chose to create a local copy of newlibs memcpy and declare it with noinline attribute. Working with a (volatile) size variable seems to give more reliable results too (they look comparable to my last results). I still can't explain the significant 4/8 byte copy difference between memcpy and libc column. Maybe disabling interrupts might give better results for you too: s32 intc = pspSdkDisableInterrupts();\ u64 time = GetCurrentTick(); \ int j;\ for (j=0; j<1000; ++j) \ memcpy_libc(d, s, n); \ gcc_elapsed = (int)(GetCurrentTick()-time); \ pspSdkEnableInterrupts(intc);\
|
|
|
|
|
Logged
|
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #22 on: September 03, 2008, 03:06:56 PM » |
|
Hi Raphael,
I added en/disable interrupts, noinline attribute, etc. I will try -lpspkernel later
Noware
|
|
|
|
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #23 on: September 04, 2008, 09:06:32 AM » |
|
Hi Raphael,
I think we can say, your implementation is the fastest in all cases and even in the cases where it is slower then daniels of flatmushs code it's mostly faster then the default memcpy, so I think I will use it to override the default memcpy
thx, Noware
|
|
|
|
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #24 on: December 12, 2009, 02:10:42 PM » |
|
After stumbling upon this old thread, I noticed that I never actually posted the updated and final code of my memcpy function  void* memcpy_vfpu( void* dst, void* src, unsigned int size ) { u8* src8 = (u8*)src; u8* dst8 = (u8*)dst; // < 8 isn't worth trying any optimisations... if (size<8) goto bytecopy;
// < 64 means we don't gain anything from using vfpu... if (size<64) { // Align dst on 4 bytes or just resume if already done while (((((u32)dst8) & 0x3)!=0) && size) { *dst8++ = *src8++; size--; } if (size<4) goto bytecopy;
// We are dst aligned now and >= 4 bytes to copy u32* src32 = (u32*)src8; u32* dst32 = (u32*)dst8; switch(((u32)src8)&0x3) { case 0: while (size&0xC) { *dst32++ = *src32++; size -= 4; } if (size==0) return (dst); // fast out while (size>=16) { *dst32++ = *src32++; *dst32++ = *src32++; *dst32++ = *src32++; *dst32++ = *src32++; size -= 16; } if (size==0) return (dst); // fast out src8 = (u8*)src32; dst8 = (u8*)dst32; break; default: { register u32 a, b, c, d; while (size>=4) { a = *src8++; b = *src8++; c = *src8++; d = *src8++; *dst32++ = (d << 24) | (c << 16) | (b << 8) | a; size -= 4; } if (size==0) return (dst); // fast out dst8 = (u8*)dst32; } break; } goto bytecopy; }
// Align dst on 16 bytes to gain from vfpu aligned stores while ((((u32)dst8) & 0xF)!=0 && size) { *dst8++ = *src8++; size--; }
// We use uncached dst to use VFPU writeback and free cpu cache for src only u8* udst8 = (u8*)((u32)dst8 | 0x40000000); // We need the 64 byte aligned address to make sure the dcache is invalidated correctly u8* dst64a = ((u32)dst8&~0x3F); // Invalidate the first line that matches up to the dst start if (size>=64) asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "cache 0x1B, 0(%0)\n" "addiu %0, %0, 64\n" "sync\n" ".set pop\n" :"+r"(dst64a)); switch(((u32)src8&0xF)) { // src aligned on 16 bytes too? nice! case 0: while (size>=64) { asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "cache 0x1B, 0(%2)\n" // Dcache writeback invalidate "lv.q c000, 0(%1)\n" "lv.q c010, 16(%1)\n" "lv.q c020, 32(%1)\n" "lv.q c030, 48(%1)\n" "sync\n" // Wait for allegrex writeback "sv.q c000, 0(%0), wb\n" "sv.q c010, 16(%0), wb\n" "sv.q c020, 32(%0), wb\n" "sv.q c030, 48(%0), wb\n" // Lots of variable updates... but get hidden in sv.q latency anyway "addiu %3, %3, -64\n" "addiu %2, %2, 64\n" "addiu %1, %1, 64\n" "addiu %0, %0, 64\n" ".set pop\n" // restore assembler option :"+r"(udst8),"+r"(src8),"+r"(dst64a),"+r"(size) : :"memory" ); } if (size>16) { // Invalidate the last cache line where the max remaining 63 bytes are asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "cache 0x1B, 0(%0)\n" "sync\n" ".set pop\n" // restore assembler option ::"r"(dst64a)); while (size>=16) { asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "lv.q c000, 0(%1)\n" "sv.q c000, 0(%0), wb\n" // Lots of variable updates... but get hidden in sv.q latency anyway "addiu %2, %2, -16\n" "addiu %1, %1, 16\n" "addiu %0, %0, 16\n" ".set pop\n" // restore assembler option :"+r"(udst8),"+r"(src8),"+r"(size) : :"memory" ); } } asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "vflush\n" // Flush VFPU writeback cache ".set pop\n" // restore assembler option ); dst8 = (u8*)((u32)udst8 & ~0x40000000); break; // src is only qword unaligned but word aligned? We can at least use ulv.q case 4: case 8: case 12: while (size>=64) { asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "cache 0x1B, 0(%2)\n" // Dcache writeback invalidate "ulv.q c000, 0(%1)\n" "ulv.q c010, 16(%1)\n" "ulv.q c020, 32(%1)\n" "ulv.q c030, 48(%1)\n" "sync\n" // Wait for allegrex writeback "sv.q c000, 0(%0), wb\n" "sv.q c010, 16(%0), wb\n" "sv.q c020, 32(%0), wb\n" "sv.q c030, 48(%0), wb\n" // Lots of variable updates... but get hidden in sv.q latency anyway "addiu %3, %3, -64\n" "addiu %2, %2, 64\n" "addiu %1, %1, 64\n" "addiu %0, %0, 64\n" ".set pop\n" // restore assembler option :"+r"(udst8),"+r"(src8),"+r"(dst64a),"+r"(size) : :"memory" ); } if (size>16) // Invalidate the last cache line where the max remaining 63 bytes are asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "cache 0x1B, 0(%0)\n" "sync\n" ".set pop\n" // restore assembler option ::"r"(dst64a)); while (size>=16) { asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "ulv.q c000, 0(%1)\n" "sv.q c000, 0(%0), wb\n" // Lots of variable updates... but get hidden in sv.q latency anyway "addiu %2, %2, -16\n" "addiu %1, %1, 16\n" "addiu %0, %0, 16\n" ".set pop\n" // restore assembler option :"+r"(udst8),"+r"(src8),"+r"(size) : :"memory" ); } asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "vflush\n" // Flush VFPU writeback cache ".set pop\n" // restore assembler option ); dst8 = (u8*)((u32)udst8 & ~0x40000000); break; // src not aligned? too bad... have to use unaligned reads default: while (size>=64) { asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "cache 0x1B, 0(%2)\n"
"lwr $8, 0(%1)\n" // "lwl $8, 3(%1)\n" // $8 = *(s + 0) "lwr $9, 4(%1)\n" // "lwl $9, 7(%1)\n" // $9 = *(s + 4) "lwr $10, 8(%1)\n" // "lwl $10, 11(%1)\n" // $10 = *(s + 8) "lwr $11, 12(%1)\n" // "lwl $11, 15(%1)\n" // $11 = *(s + 12) "mtv $8, s000\n" "mtv $9, s001\n" "mtv $10, s002\n" "mtv $11, s003\n"
"lwr $8, 16(%1)\n" "lwl $8, 19(%1)\n" "lwr $9, 20(%1)\n" "lwl $9, 23(%1)\n" "lwr $10, 24(%1)\n" "lwl $10, 27(%1)\n" "lwr $11, 28(%1)\n" "lwl $11, 31(%1)\n" "mtv $8, s010\n" "mtv $9, s011\n" "mtv $10, s012\n" "mtv $11, s013\n" "lwr $8, 32(%1)\n" "lwl $8, 35(%1)\n" "lwr $9, 36(%1)\n" "lwl $9, 39(%1)\n" "lwr $10, 40(%1)\n" "lwl $10, 43(%1)\n" "lwr $11, 44(%1)\n" "lwl $11, 47(%1)\n" "mtv $8, s020\n" "mtv $9, s021\n" "mtv $10, s022\n" "mtv $11, s023\n"
"lwr $8, 48(%1)\n" "lwl $8, 51(%1)\n" "lwr $9, 52(%1)\n" "lwl $9, 55(%1)\n" "lwr $10, 56(%1)\n" "lwl $10, 59(%1)\n" "lwr $11, 60(%1)\n" "lwl $11, 63(%1)\n" "mtv $8, s030\n" "mtv $9, s031\n" "mtv $10, s032\n" "mtv $11, s033\n" "sync\n" "sv.q c000, 0(%0), wb\n" "sv.q c010, 16(%0), wb\n" "sv.q c020, 32(%0), wb\n" "sv.q c030, 48(%0), wb\n" // Lots of variable updates... but get hidden in sv.q latency anyway "addiu %3, %3, -64\n" "addiu %2, %2, 64\n" "addiu %1, %1, 64\n" "addiu %0, %0, 64\n" ".set pop\n" // restore assembler option :"+r"(udst8),"+r"(src8),"+r"(dst64a),"+r"(size) : :"$8","$9","$10","$11","memory" ); } if (size>16) // Invalidate the last cache line where the max remaining 63 bytes are asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "cache 0x1B, 0(%0)\n" "sync\n" ".set pop\n" // restore assembler option ::"r"(dst64a)); while (size>=16) { asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "lwr $8, 0(%1)\n" // "lwl $8, 3(%1)\n" // $8 = *(s + 0) "lwr $9, 4(%1)\n" // "lwl $9, 7(%1)\n" // $9 = *(s + 4) "lwr $10, 8(%1)\n" // "lwl $10, 11(%1)\n" // $10 = *(s + 8) "lwr $11, 12(%1)\n" // "lwl $11, 15(%1)\n" // $11 = *(s + 12) "mtv $8, s000\n" "mtv $9, s001\n" "mtv $10, s002\n" "mtv $11, s003\n"
"sv.q c000, 0(%0), wb\n" // Lots of variable updates... but get hidden in sv.q latency anyway "addiu %2, %2, -16\n" "addiu %1, %1, 16\n" "addiu %0, %0, 16\n" ".set pop\n" // restore assembler option :"+r"(udst8),"+r"(src8),"+r"(size) : :"$8","$9","$10","$11","memory" ); } asm(".set push\n" // save assembler option ".set noreorder\n" // suppress reordering "vflush\n" // Flush VFPU writeback cache ".set pop\n" // restore assembler option ); dst8 = (u8*)((u32)udst8 & ~0x40000000); break; } bytecopy: // Copy the remains byte per byte... while (size--) { *dst8++ = *src8++; } return (dst); }
Who said only short code is fast? 
|
|
|
|
|
Logged
|
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #25 on: December 13, 2009, 01:58:58 AM » |
|
Hi Raphael, I finally never used your memcopy since it gave me some errors, but thx I will try your version of memcopy now [EDIT] Yes now it works without crashing  Noware
|
|
|
|
« Last Edit: December 13, 2009, 02:27:36 AM by Noware »
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #26 on: December 13, 2009, 05:48:44 AM » |
|
Why didn't you hit me hard on my head and tell me to post the final version then  Well, at least it's there now.
|
|
|
|
|
Logged
|
|
|
|
Bluddy
Newbie
Karma: +0/-0
Offline
Posts: 12
1411.75 points View InventorySend Money to Bluddy
|
 |
« Reply #27 on: June 17, 2010, 06:11:53 AM » |
|
Sorry to bring up this old thread.
Raphael (or anyone else) does this mean that the PSP has another cache just for the VFPU? That's what I'm getting from what Raphael said here.
|
|
|
|
|
Logged
|
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #28 on: November 13, 2010, 06:46:52 PM » |
|
If I had come by earlier again, I'd have answered earlier.. :/
The thing is, the VFPU at least has a writeback cache, which isn't used to full potential unless the asm code is specially written for it. I'm not sure if the VFPU actually has a full own cache, but I doubt it seeing how normal vfpu ops make use of the CPU cache unless explicitly disabled with uncached memory adresses.
The difference is that the WB cache is much simpler in design (and hence transistor cost), because it only stores (caches) X write operations before sending them to the memory interface.
|
|
|
|
|
Logged
|
|
|
|
|