Skip to: Site menu | Main content


Welcome to PSP-Programming.com, a place for developers to get together.

Welcome to the forums. Here you can find other user tutorials as well as homebrew releases and the source code repository. You can also ask for help with your code here and post your own homebrew!

PSP-Programming.com Forums
February 08, 2012, 07:12:02 PM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length

News: Join our IRC channel: ##psp-programming on freenode
Home Help Search Shop Login Register
Digg This!
Pages: 1 [2]
Print
Author Topic: fast memcpy  (Read 12400 times)
Flatmush
Has a normal user title
Administrator
Hero Member
*

Karma: +84/-26
Offline Offline

Posts: 1046
12906.27 points

View Inventory
Send Money to Flatmush

The Omniscient One


View Profile WWW
« Reply #15 on: September 03, 2008, 02:25:21 AM »

Wow, can't even think how you'd be able to improve on yours raph, nice job.

After reading up on mips assembly I kinda realised half of my cases were absolute rubbish anyway as it cant read or write more than a word at a time.
Logged

Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c

I am nerdier than 66% of all people. Are you nerdier? Click here to find out!I am 62% loser. What about you? Click here to find out!NerdTests.com User Test: The Can I Run A Business Test.

Hehe I'm a "Hero Member" because I bought posts back when they were in the shop.

Creator of FlatEditPSP, funcLib and flAstro


Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #16 on: September 03, 2008, 04:18:43 AM »

Well, as said the weakspot is source unaligned copies in range 8-128, so adding special cases in for that would help get closer to daniels implementation in that range and make it pretty much ultimative Wink
There's still that while (size>=16) loop in the vfpu code that writeback invalidates a whole cacheline (64bytes) each loop, which is more than necessary, as it would only need it every fourth iteration. I didn't take the effort to implement it differently and bench it, same as I didn't really bench the benefits from that loop over a straight C version.

And yeah, I already wondered why you'd create such *fake* types like u128 and u256 as u32 arrays and try to copy those, seeing how PSP doesn't even support u64 straight. With good compiler optimization it should come out pretty much the same as the normal *dst++ = *src++ loops though, so it at least makes the code a bit more straightforward.
« Last Edit: September 03, 2008, 04:33:11 AM by Raphael » Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #17 on: September 03, 2008, 07:10:42 AM »

I just found that my previous version of vfpu_memcpy was faulty (at least for dst 64 byte unaligned copies). The reason was, that I was aligning dst on 16 bytes (for the sv.q stores) but at the same time used that address for the cache invalidate, which would then only invalidate the next lower 64 byte aligned address, leaving out bytes at the end which would then not get written correctly.
The solution is to do an additional line invalidate before the actual copy loop on the next lower 64 byte aligned address, and then keep invalidating on the next bigger 64 byte aligned addresses in the loop. This will lead to an additional unneeded invalidate for cases where dst is already aligned on 64 bytes now, but the performance drop is not really noticable. Also, I remembered why "ulv.q was bugged with sv.q, wb" - it's not really bugged, it's just that ulv.q only reads qword unaligned addresses that are still word aligned.
So I put in another special case when src is only word aligned and not qword aligned and used ulv.q there, which works out pretty nicely.

Here's the updated test run results:
Code:
     1 bytes (     4 bytes aligned),    100.0%,     91.7%,     69.8%,    100.8%
     2 bytes (     4 bytes aligned),    100.0%,     92.6%,     76.5%,    100.6%
     4 bytes (     4 bytes aligned),    100.0%,    135.3%,    111.3%,     69.5%
     8 bytes (     4 bytes aligned),    100.0%,    125.5%,    170.2%,    150.6%
    16 bytes (     4 bytes aligned),    100.0%,     63.5%,    102.2%,     84.9%
    32 bytes (     4 bytes aligned),    100.0%,     87.6%,    116.0%,     88.8%
    64 bytes (     4 bytes aligned),    100.0%,     90.2%,    116.5%,    149.6%
   128 bytes (     4 bytes aligned),    100.0%,    108.6%,    121.7%,    210.3%
   256 bytes (     4 bytes aligned),    100.0%,    102.9%,    119.1%,    269.0%
   512 bytes (     4 bytes aligned),    100.0%,    109.2%,    118.0%,    319.2%
  1024 bytes (     4 bytes aligned),    100.0%,    112.3%,    117.4%,    354.1%
  2048 bytes (     4 bytes aligned),    100.0%,    117.4%,    116.4%,    380.9%
  4096 bytes (     4 bytes aligned),    100.0%,    116.0%,    117.2%,    391.5%
  8192 bytes (     4 bytes aligned),    100.0%,    117.0%,    116.8%,    399.4%
 16384 bytes (     4 bytes aligned),    100.0%,    103.9%,    103.9%,   1053.8%
 32768 bytes (     4 bytes aligned),    100.0%,    103.8%,    103.9%,    182.3%
 65536 bytes (     4 bytes aligned),    100.0%,    103.8%,    103.9%,    182.4%
     1 bytes (          source + 1),    100.0%,     72.9%,     62.0%,    100.8%
     2 bytes (          source + 1),    100.0%,     92.1%,     72.1%,    100.6%
     4 bytes (          source + 1),    100.0%,     83.5%,     67.4%,    100.9%
     8 bytes (          source + 1),    100.0%,     70.4%,     71.7%,     84.5%
    16 bytes (          source + 1),    100.0%,    118.2%,     84.2%,    106.4%
    32 bytes (          source + 1),    100.0%,    152.0%,     85.6%,    124.0%
    64 bytes (          source + 1),    100.0%,    207.8%,     80.6%,    130.5%
   128 bytes (          source + 1),    100.0%,    253.1%,     87.9%,    158.2%
   256 bytes (          source + 1),    100.0%,    292.3%,     85.8%,    226.1%
   512 bytes (          source + 1),    100.0%,    317.2%,     88.0%,    297.1%
  1024 bytes (          source + 1),    100.0%,    328.8%,     87.6%,    342.4%
  2048 bytes (          source + 1),    100.0%,    332.9%,     87.6%,    373.6%
  4096 bytes (          source + 1),    100.0%,    336.2%,     87.6%,    393.3%
  8192 bytes (          source + 1),    100.0%,    332.5%,     87.6%,    405.3%
 16384 bytes (          source + 1),    100.0%,    228.7%,     90.8%,    556.3%
 32768 bytes (          source + 1),    100.0%,    229.0%,     90.7%,    323.0%
 65536 bytes (          source + 1),    100.0%,    229.2%,     90.7%,    323.7%
     1 bytes (          source + 2),    100.0%,     91.0%,     69.5%,    100.0%
     2 bytes (          source + 2),    100.0%,     93.2%,     69.5%,    100.6%
     4 bytes (          source + 2),    100.0%,     83.8%,     91.2%,    100.4%
     8 bytes (          source + 2),    100.0%,     89.6%,    110.2%,     89.0%
    16 bytes (          source + 2),    100.0%,    122.0%,    132.2%,    109.6%
    32 bytes (          source + 2),    100.0%,    156.7%,    148.2%,    133.2%
    64 bytes (          source + 2),    100.0%,    214.7%,    148.3%,    133.7%
   128 bytes (          source + 2),    100.0%,    258.6%,    156.6%,    157.0%
   256 bytes (          source + 2),    100.0%,    298.8%,    174.3%,    227.0%
   512 bytes (          source + 2),    100.0%,    318.1%,    172.7%,    297.1%
  1024 bytes (          source + 2),    100.0%,    325.6%,    174.5%,    339.9%
  2048 bytes (          source + 2),    100.0%,    333.1%,    174.8%,    372.6%
  4096 bytes (          source + 2),    100.0%,    335.4%,    175.1%,    393.6%
  8192 bytes (          source + 2),    100.0%,    333.3%,    174.1%,    405.2%
 16384 bytes (          source + 2),    100.0%,    227.7%,    144.1%,    552.9%
 32768 bytes (          source + 2),    100.0%,    228.1%,    144.1%,    321.6%
 65536 bytes (          source + 2),    100.0%,    228.3%,    144.1%,    322.5%
     1 bytes (          source + 3),    100.0%,     91.7%,     55.9%,    100.8%
     2 bytes (          source + 3),    100.0%,     92.6%,     72.1%,    100.6%
     4 bytes (          source + 3),    100.0%,     83.8%,     75.9%,    100.4%
     8 bytes (          source + 3),    100.0%,     95.9%,     79.5%,     92.9%
    16 bytes (          source + 3),    100.0%,    129.4%,     84.1%,    113.4%
    32 bytes (          source + 3),    100.0%,    161.8%,     85.5%,    124.0%
    64 bytes (          source + 3),    100.0%,    221.0%,     88.0%,    135.7%
   128 bytes (          source + 3),    100.0%,    267.7%,     88.4%,    161.6%
   256 bytes (          source + 3),    100.0%,    291.8%,     85.7%,    225.1%
   512 bytes (          source + 3),    100.0%,    312.7%,     87.9%,    297.1%
  1024 bytes (          source + 3),    100.0%,    325.8%,     87.6%,    340.1%
  2048 bytes (          source + 3),    100.0%,    333.5%,     87.6%,    374.2%
  4096 bytes (          source + 3),    100.0%,    336.6%,     87.6%,    393.8%
  8192 bytes (          source + 3),    100.0%,    332.8%,     87.7%,    405.2%
 16384 bytes (          source + 3),    100.0%,    224.5%,     90.9%,    545.1%
 32768 bytes (          source + 3),    100.0%,    224.8%,     90.9%,    316.9%
 65536 bytes (          source + 3),    100.0%,    224.9%,     90.9%,    317.6%
     1 bytes (            dest + 1),    100.0%,     91.0%,     69.5%,    100.8%
     2 bytes (            dest + 1),    100.0%,     92.6%,     62.2%,    100.6%
     4 bytes (            dest + 1),    100.0%,    119.5%,     76.1%,    143.8%
     8 bytes (            dest + 1),    100.0%,     98.6%,     64.0%,    113.1%
    16 bytes (            dest + 1),    100.0%,    139.3%,     75.0%,    137.0%
    32 bytes (            dest + 1),    100.0%,    204.6%,     76.0%,    140.0%
    64 bytes (            dest + 1),    100.0%,    256.1%,     82.5%,    268.1%
   128 bytes (            dest + 1),    100.0%,    287.8%,     80.9%,    321.1%
   256 bytes (            dest + 1),    100.0%,    317.3%,     87.0%,    368.6%
   512 bytes (            dest + 1),    100.0%,    328.4%,     87.2%,    389.9%
  1024 bytes (            dest + 1),    100.0%,    334.7%,     87.2%,    402.5%
  2048 bytes (            dest + 1),    100.0%,    335.8%,     87.5%,    407.0%
  4096 bytes (            dest + 1),    100.0%,    336.9%,     87.5%,    412.2%
  8192 bytes (            dest + 1),    100.0%,    333.6%,     87.6%,    416.8%
 16384 bytes (            dest + 1),    100.0%,    215.6%,     90.0%,    529.1%
 32768 bytes (            dest + 1),    100.0%,    215.8%,     90.0%,    326.4%
 65536 bytes (            dest + 1),    100.0%,    216.0%,     90.0%,    326.7%
     1 bytes (            dest + 2),    100.0%,     94.4%,     48.4%,    104.6%
     2 bytes (            dest + 2),    100.0%,     94.9%,     78.4%,    103.1%
     4 bytes (            dest + 2),    100.0%,     85.6%,     81.4%,    103.1%
     8 bytes (            dest + 2),    100.0%,    104.7%,     90.9%,    114.4%
    16 bytes (            dest + 2),    100.0%,    138.0%,    109.5%,    130.2%
    32 bytes (            dest + 2),    100.0%,    212.2%,    127.5%,    140.3%
    64 bytes (            dest + 2),    100.0%,    257.6%,    146.3%,    263.7%
   128 bytes (            dest + 2),    100.0%,    291.9%,    150.9%,    313.2%
   256 bytes (            dest + 2),    100.0%,    319.8%,    168.4%,    368.7%
   512 bytes (            dest + 2),    100.0%,    329.8%,    170.6%,    389.9%
  1024 bytes (            dest + 2),    100.0%,    331.2%,    173.2%,    402.5%
  2048 bytes (            dest + 2),    100.0%,    336.2%,    174.3%,    407.1%
  4096 bytes (            dest + 2),    100.0%,    338.0%,    174.4%,    411.3%
  8192 bytes (            dest + 2),    100.0%,    335.9%,    174.4%,    415.5%
 16384 bytes (            dest + 2),    100.0%,    215.6%,    149.0%,    533.0%
 32768 bytes (            dest + 2),    100.0%,    215.9%,    149.2%,    326.3%
 65536 bytes (            dest + 2),    100.0%,    216.0%,    149.3%,    326.6%
     1 bytes (            dest + 3),    100.0%,    106.9%,     81.5%,    117.6%
     2 bytes (            dest + 3),    100.0%,    105.1%,     71.0%,    114.8%
     4 bytes (            dest + 3),    100.0%,     91.9%,     67.2%,    111.1%
     8 bytes (            dest + 3),    100.0%,    107.4%,     58.3%,    120.1%
    16 bytes (            dest + 3),    100.0%,    116.8%,     67.3%,    134.2%
    32 bytes (            dest + 3),    100.0%,    211.8%,     77.6%,    142.7%
    64 bytes (            dest + 3),    100.0%,    257.3%,     82.0%,    266.5%
   128 bytes (            dest + 3),    100.0%,    266.2%,     83.9%,    323.4%
   256 bytes (            dest + 3),    100.0%,    309.5%,     85.2%,    364.1%
   512 bytes (            dest + 3),    100.0%,    322.0%,     87.3%,    390.5%
  1024 bytes (            dest + 3),    100.0%,    335.1%,     87.3%,    401.9%
  2048 bytes (            dest + 3),    100.0%,    336.0%,     87.5%,    407.1%
  4096 bytes (            dest + 3),    100.0%,    337.3%,     87.5%,    411.4%
  8192 bytes (            dest + 3),    100.0%,    337.6%,     87.8%,    416.8%
 16384 bytes (            dest + 3),    100.0%,    215.8%,     90.2%,    532.6%
 32768 bytes (            dest + 3),    100.0%,    215.9%,     90.2%,    326.3%
 65536 bytes (            dest + 3),    100.0%,    216.0%,     90.2%,    326.6%
     1 bytes (source + 1, dest + 3),    100.0%,    108.3%,     73.7%,    120.8%
     2 bytes (source + 1, dest + 3),    100.0%,    105.6%,     82.7%,    114.7%
     4 bytes (source + 1, dest + 3),    100.0%,     92.3%,     83.6%,     95.1%
     8 bytes (source + 1, dest + 3),    100.0%,     93.3%,     84.9%,     90.6%
    16 bytes (source + 1, dest + 3),    100.0%,    124.5%,     71.1%,    110.0%
    32 bytes (source + 1, dest + 3),    100.0%,    149.9%,     87.4%,    126.5%
    64 bytes (source + 1, dest + 3),    100.0%,    214.7%,     87.4%,    134.1%
   128 bytes (source + 1, dest + 3),    100.0%,    258.5%,     86.8%,    157.5%
   256 bytes (source + 1, dest + 3),    100.0%,    294.1%,     86.6%,    226.6%
   512 bytes (source + 1, dest + 3),    100.0%,    314.5%,     87.5%,    296.8%
  1024 bytes (source + 1, dest + 3),    100.0%,    329.3%,     87.7%,    343.0%
  2048 bytes (source + 1, dest + 3),    100.0%,    333.2%,     87.6%,    373.9%
  4096 bytes (source + 1, dest + 3),    100.0%,    335.8%,     87.6%,    393.4%
  8192 bytes (source + 1, dest + 3),    100.0%,    329.8%,     87.7%,    407.2%
 16384 bytes (source + 1, dest + 3),    100.0%,    215.1%,     90.0%,    520.0%
 32768 bytes (source + 1, dest + 3),    100.0%,    215.5%,     90.0%,    300.6%
 65536 bytes (source + 1, dest + 3),    100.0%,    215.8%,     89.9%,    301.4%
     1 bytes (source + 2, dest + 2),    100.0%,     66.7%,     71.6%,    104.6%
     2 bytes (source + 2, dest + 2),    100.0%,     95.5%,     61.3%,    103.7%
     4 bytes (source + 2, dest + 2),    100.0%,     85.2%,     92.4%,    102.7%
     8 bytes (source + 2, dest + 2),    100.0%,    103.2%,    110.8%,     92.7%
    16 bytes (source + 2, dest + 2),    100.0%,    119.9%,    119.7%,    132.8%
    32 bytes (source + 2, dest + 2),    100.0%,    164.2%,    139.6%,    204.0%
    64 bytes (source + 2, dest + 2),    100.0%,    287.5%,    160.3%,    148.6%
   128 bytes (source + 2, dest + 2),    100.0%,    402.3%,    172.6%,    226.2%
   256 bytes (source + 2, dest + 2),    100.0%,    486.8%,    174.4%,    401.6%
   512 bytes (source + 2, dest + 2),    100.0%,    546.2%,    173.1%,    679.1%
  1024 bytes (source + 2, dest + 2),    100.0%,    583.0%,    174.6%,    964.1%
  2048 bytes (source + 2, dest + 2),    100.0%,    596.7%,    174.9%,   1345.4%
  4096 bytes (source + 2, dest + 2),    100.0%,    610.7%,    174.9%,   1655.4%
  8192 bytes (source + 2, dest + 2),    100.0%,    585.8%,    173.4%,   1841.0%
 16384 bytes (source + 2, dest + 2),    100.0%,    268.8%,    148.9%,   2306.2%
 32768 bytes (source + 2, dest + 2),    100.0%,    269.3%,    149.1%,    472.2%
 65536 bytes (source + 2, dest + 2),    100.0%,    269.7%,    149.3%,    473.6%
     1 bytes (source + 3, dest + 1),    100.0%,     91.0%,     55.9%,    100.8%
     2 bytes (source + 3, dest + 1),    100.0%,     92.6%,     72.1%,    100.6%
     4 bytes (source + 3, dest + 1),    100.0%,     83.5%,     75.7%,    100.0%
     8 bytes (source + 3, dest + 1),    100.0%,    100.9%,     79.5%,     92.7%
    16 bytes (source + 3, dest + 1),    100.0%,    134.2%,     84.0%,    113.0%
    32 bytes (source + 3, dest + 1),    100.0%,    165.9%,     75.8%,    129.1%
    64 bytes (source + 3, dest + 1),    100.0%,    220.8%,     81.9%,    133.1%
   128 bytes (source + 3, dest + 1),    100.0%,    259.7%,     84.5%,    159.0%
   256 bytes (source + 3, dest + 1),    100.0%,    303.7%,     88.6%,    226.0%
   512 bytes (source + 3, dest + 1),    100.0%,    320.8%,     87.9%,    296.9%
  1024 bytes (source + 3, dest + 1),    100.0%,    326.8%,     87.8%,    343.9%
  2048 bytes (source + 3, dest + 1),    100.0%,    333.8%,     87.7%,    373.0%
  4096 bytes (source + 3, dest + 1),    100.0%,    336.7%,     87.6%,    394.0%
  8192 bytes (source + 3, dest + 1),    100.0%,    330.3%,     87.6%,    407.1%
 16384 bytes (source + 3, dest + 1),    100.0%,    227.6%,     91.4%,    549.4%
 32768 bytes (source + 3, dest + 1),    100.0%,    228.1%,     91.3%,    321.6%
 65536 bytes (source + 3, dest + 1),    100.0%,    228.3%,     91.2%,    322.4%
Notice the 4byte aligned copy result of 64% only, even though the code for that size wasn't touched at all. That's the sporadic inconsistencies I was talking about earlier (My best quess would be that PSPLink somehow interferes... maybe it would be a good idea to disable interrupts during the runs). I also updated the code in my previous post with the new code for vfpu_memcpy.
Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #18 on: September 03, 2008, 07:19:19 AM »

I just redid the test with disabled interrupts and it seems that really helps to steady up the results:
Code:
     1 bytes (     4 bytes aligned),    100.0%,     91.7%,     69.8%,    100.8%
     2 bytes (     4 bytes aligned),    100.0%,     92.7%,     77.4%,    100.6%
     4 bytes (     4 bytes aligned),    100.0%,    135.9%,    111.8%,    100.4%
     8 bytes (     4 bytes aligned),    100.0%,    125.5%,    170.2%,    150.6%
    16 bytes (     4 bytes aligned),    100.0%,     63.5%,    102.2%,     85.2%
    32 bytes (     4 bytes aligned),    100.0%,     87.9%,    117.1%,     89.6%
    64 bytes (     4 bytes aligned),    100.0%,     96.5%,    116.7%,    149.0%
   128 bytes (     4 bytes aligned),    100.0%,    104.3%,    116.8%,    202.5%
   256 bytes (     4 bytes aligned),    100.0%,    109.6%,    116.8%,    264.1%
   512 bytes (     4 bytes aligned),    100.0%,    112.9%,    116.6%,    316.3%
  1024 bytes (     4 bytes aligned),    100.0%,    114.6%,    116.6%,    350.1%
  2048 bytes (     4 bytes aligned),    100.0%,    115.6%,    116.6%,    374.2%
  4096 bytes (     4 bytes aligned),    100.0%,    116.1%,    116.6%,    386.8%
  8192 bytes (     4 bytes aligned),    100.0%,    116.4%,    116.6%,    393.7%
 16384 bytes (     4 bytes aligned),    100.0%,    103.8%,    103.8%,   1061.8%
 32768 bytes (     4 bytes aligned),    100.0%,    103.8%,    103.8%,    182.1%
 65536 bytes (     4 bytes aligned),    100.0%,    103.8%,    103.8%,    182.3%
     1 bytes (          source + 1),    100.0%,     91.7%,     62.7%,    102.3%
     2 bytes (          source + 1),    100.0%,     93.2%,     72.6%,    101.2%
     4 bytes (          source + 1),    100.0%,     83.8%,     76.2%,    100.9%
     8 bytes (          source + 1),    100.0%,     86.2%,     80.4%,     85.4%
    16 bytes (          source + 1),    100.0%,    118.2%,     84.0%,    106.0%
    32 bytes (          source + 1),    100.0%,    152.0%,     85.5%,    123.8%
    64 bytes (          source + 1),    100.0%,    207.6%,     86.5%,    130.7%
   128 bytes (          source + 1),    100.0%,    255.9%,     87.0%,    156.8%
   256 bytes (          source + 1),    100.0%,    291.3%,     87.2%,    225.0%
   512 bytes (          source + 1),    100.0%,    313.1%,     87.4%,    293.3%
  1024 bytes (          source + 1),    100.0%,    325.6%,     87.4%,    339.9%
  2048 bytes (          source + 1),    100.0%,    332.2%,     87.5%,    373.2%
  4096 bytes (          source + 1),    100.0%,    335.6%,     87.5%,    392.8%
  8192 bytes (          source + 1),    100.0%,    332.6%,     87.6%,    403.7%
 16384 bytes (          source + 1),    100.0%,    229.0%,     90.7%,    558.9%
 32768 bytes (          source + 1),    100.0%,    229.3%,     90.7%,    323.4%
 65536 bytes (          source + 1),    100.0%,    229.5%,     90.7%,    324.1%
     1 bytes (          source + 2),    100.0%,     91.7%,     69.5%,    100.8%
     2 bytes (          source + 2),    100.0%,     93.2%,     69.8%,    101.2%
     4 bytes (          source + 2),    100.0%,     83.8%,     91.6%,    100.9%
     8 bytes (          source + 2),    100.0%,     89.2%,    110.3%,     88.7%
    16 bytes (          source + 2),    100.0%,    122.2%,    132.4%,    110.0%
    32 bytes (          source + 2),    100.0%,    155.2%,    148.7%,    126.6%
    64 bytes (          source + 2),    100.0%,    211.1%,    160.1%,    131.5%
   128 bytes (          source + 2),    100.0%,    258.7%,    167.0%,    157.3%
   256 bytes (          source + 2),    100.0%,    293.0%,    170.8%,    225.2%
   512 bytes (          source + 2),    100.0%,    314.1%,    172.8%,    293.5%
  1024 bytes (          source + 2),    100.0%,    326.1%,    173.9%,    340.7%
  2048 bytes (          source + 2),    100.0%,    332.5%,    174.4%,    373.5%
  4096 bytes (          source + 2),    100.0%,    335.8%,    174.7%,    392.8%
  8192 bytes (          source + 2),    100.0%,    332.8%,    174.0%,    403.7%
 16384 bytes (          source + 2),    100.0%,    228.0%,    144.1%,    556.5%
 32768 bytes (          source + 2),    100.0%,    228.5%,    144.1%,    322.2%
 65536 bytes (          source + 2),    100.0%,    228.7%,    144.1%,    322.9%
     1 bytes (          source + 3),    100.0%,     91.0%,     55.9%,    100.8%
     2 bytes (          source + 3),    100.0%,     93.2%,     72.6%,    101.2%
     4 bytes (          source + 3),    100.0%,     83.8%,     76.2%,    100.9%
     8 bytes (          source + 3),    100.0%,     95.9%,     79.9%,     92.9%
    16 bytes (          source + 3),    100.0%,    129.6%,     84.3%,    113.8%
    32 bytes (          source + 3),    100.0%,    161.8%,     85.5%,    129.2%
    64 bytes (          source + 3),    100.0%,    217.1%,     86.5%,    133.3%
   128 bytes (          source + 3),    100.0%,    262.9%,     87.0%,    158.9%
   256 bytes (          source + 3),    100.0%,    295.9%,     87.2%,    225.3%
   512 bytes (          source + 3),    100.0%,    315.9%,     87.4%,    293.6%
  1024 bytes (          source + 3),    100.0%,    327.0%,     87.4%,    341.7%
  2048 bytes (          source + 3),    100.0%,    333.0%,     87.5%,    373.8%
  4096 bytes (          source + 3),    100.0%,    336.0%,     87.5%,    393.2%
  8192 bytes (          source + 3),    100.0%,    335.0%,     87.5%,    402.3%
 16384 bytes (          source + 3),    100.0%,    225.0%,     90.8%,    552.3%
 32768 bytes (          source + 3),    100.0%,    225.2%,     90.9%,    317.4%
 65536 bytes (          source + 3),    100.0%,    225.3%,     90.9%,    318.0%
     1 bytes (            dest + 1),    100.0%,     92.4%,     70.9%,    103.1%
     2 bytes (            dest + 1),    100.0%,     93.2%,     62.1%,    100.6%
     4 bytes (            dest + 1),    100.0%,     83.8%,     61.2%,    100.4%
     8 bytes (            dest + 1),    100.0%,     98.3%,     63.9%,    113.1%
    16 bytes (            dest + 1),    100.0%,    132.3%,     71.0%,    129.8%
    32 bytes (            dest + 1),    100.0%,    204.6%,     76.1%,    140.0%
    64 bytes (            dest + 1),    100.0%,    252.1%,     81.1%,    264.3%
   128 bytes (            dest + 1),    100.0%,    288.1%,     84.1%,    321.2%
   256 bytes (            dest + 1),    100.0%,    311.1%,     85.7%,    361.5%
   512 bytes (            dest + 1),    100.0%,    324.4%,     86.6%,    388.4%
  1024 bytes (            dest + 1),    100.0%,    331.6%,     87.1%,    399.1%
  2048 bytes (            dest + 1),    100.0%,    335.3%,     87.3%,    406.8%
  4096 bytes (            dest + 1),    100.0%,    337.2%,     87.4%,    410.7%
  8192 bytes (            dest + 1),    100.0%,    335.8%,     87.5%,    413.7%
 16384 bytes (            dest + 1),    100.0%,    215.6%,     90.0%,    534.7%
 32768 bytes (            dest + 1),    100.0%,    215.8%,     90.0%,    326.3%
 65536 bytes (            dest + 1),    100.0%,    215.9%,     90.0%,    326.6%
     1 bytes (            dest + 2),    100.0%,     91.0%,     69.5%,    100.8%
     2 bytes (            dest + 2),    100.0%,     93.2%,     77.5%,    101.9%
     4 bytes (            dest + 2),    100.0%,     83.8%,     79.4%,    100.4%
     8 bytes (            dest + 2),    100.0%,    103.8%,     89.9%,    113.4%
    16 bytes (            dest + 2),    100.0%,    136.9%,    108.6%,    129.4%
    32 bytes (            dest + 2),    100.0%,    211.1%,    127.3%,    140.1%
    64 bytes (            dest + 2),    100.0%,    256.9%,    146.1%,    263.8%
   128 bytes (            dest + 2),    100.0%,    291.7%,    158.8%,    321.0%
   256 bytes (            dest + 2),    100.0%,    313.2%,    166.4%,    361.4%
   512 bytes (            dest + 2),    100.0%,    325.5%,    170.5%,    385.2%
  1024 bytes (            dest + 2),    100.0%,    332.2%,    172.7%,    400.3%
  2048 bytes (            dest + 2),    100.0%,    335.6%,    173.8%,    406.8%
  4096 bytes (            dest + 2),    100.0%,    337.4%,    174.4%,    410.7%
  8192 bytes (            dest + 2),    100.0%,    335.8%,    174.2%,    413.5%
 16384 bytes (            dest + 2),    100.0%,    215.6%,    149.1%,    534.6%
 32768 bytes (            dest + 2),    100.0%,    215.9%,    149.2%,    326.4%
 65536 bytes (            dest + 2),    100.0%,    216.0%,    149.3%,    326.7%
     1 bytes (            dest + 3),    100.0%,     94.4%,     71.6%,    103.0%
     2 bytes (            dest + 3),    100.0%,     95.5%,     64.3%,    103.7%
     4 bytes (            dest + 3),    100.0%,     84.9%,     62.1%,    101.8%
     8 bytes (            dest + 3),    100.0%,    102.9%,     64.9%,    115.1%
    16 bytes (            dest + 3),    100.0%,    135.3%,     71.4%,    130.5%
    32 bytes (            dest + 3),    100.0%,    208.9%,     76.4%,    140.6%
    64 bytes (            dest + 3),    100.0%,    255.0%,     81.3%,    265.1%
   128 bytes (            dest + 3),    100.0%,    290.1%,     84.2%,    322.1%
   256 bytes (            dest + 3),    100.0%,    312.3%,     85.8%,    365.2%
   512 bytes (            dest + 3),    100.0%,    325.1%,     86.6%,    385.3%
  1024 bytes (            dest + 3),    100.0%,    331.9%,     87.1%,    399.3%
  2048 bytes (            dest + 3),    100.0%,    335.5%,     87.3%,    406.8%
  4096 bytes (            dest + 3),    100.0%,    337.3%,     87.4%,    410.8%
  8192 bytes (            dest + 3),    100.0%,    335.8%,     87.5%,    413.7%
 16384 bytes (            dest + 3),    100.0%,    215.7%,     90.1%,    534.8%
 32768 bytes (            dest + 3),    100.0%,    215.9%,     90.2%,    326.4%
 65536 bytes (            dest + 3),    100.0%,    216.0%,     90.2%,    326.7%
     1 bytes (source + 1, dest + 3),    100.0%,     93.2%,     63.8%,    103.8%
     2 bytes (source + 1, dest + 3),    100.0%,     94.9%,     74.3%,    103.7%
     4 bytes (source + 1, dest + 3),    100.0%,     85.3%,     77.6%,    102.7%
     8 bytes (source + 1, dest + 3),    100.0%,     88.4%,     80.6%,     85.9%
    16 bytes (source + 1, dest + 3),    100.0%,    120.8%,     84.7%,    107.1%
    32 bytes (source + 1, dest + 3),    100.0%,    154.3%,     85.9%,    124.5%
    64 bytes (source + 1, dest + 3),    100.0%,    209.6%,     86.6%,    131.2%
   128 bytes (source + 1, dest + 3),    100.0%,    257.4%,     87.1%,    156.9%
   256 bytes (source + 1, dest + 3),    100.0%,    292.2%,     87.3%,    225.3%
   512 bytes (source + 1, dest + 3),    100.0%,    313.7%,     87.4%,    293.5%
  1024 bytes (source + 1, dest + 3),    100.0%,    325.9%,     87.4%,    339.7%
  2048 bytes (source + 1, dest + 3),    100.0%,    332.4%,     87.5%,    373.5%
  4096 bytes (source + 1, dest + 3),    100.0%,    335.7%,     87.5%,    392.8%
  8192 bytes (source + 1, dest + 3),    100.0%,    329.7%,     87.6%,    405.5%
 16384 bytes (source + 1, dest + 3),    100.0%,    215.0%,     90.0%,    521.9%
 32768 bytes (source + 1, dest + 3),    100.0%,    215.5%,     89.9%,    300.7%
 65536 bytes (source + 1, dest + 3),    100.0%,    215.8%,     89.9%,    301.4%
     1 bytes (source + 2, dest + 2),    100.0%,     91.7%,     69.5%,    101.5%
     2 bytes (source + 2, dest + 2),    100.0%,     92.1%,     69.4%,    100.6%
     4 bytes (source + 2, dest + 2),    100.0%,     83.5%,     90.8%,    100.9%
     8 bytes (source + 2, dest + 2),    100.0%,    102.3%,    109.9%,     91.7%
    16 bytes (source + 2, dest + 2),    100.0%,    143.8%,    131.8%,    132.1%
    32 bytes (source + 2, dest + 2),    100.0%,    190.0%,    148.2%,    203.4%
    64 bytes (source + 2, dest + 2),    100.0%,    287.0%,    160.0%,    149.7%
   128 bytes (source + 2, dest + 2),    100.0%,    388.5%,    166.9%,    223.0%
   256 bytes (source + 2, dest + 2),    100.0%,    477.0%,    170.8%,    402.5%
   512 bytes (source + 2, dest + 2),    100.0%,    539.0%,    172.8%,    670.3%
  1024 bytes (source + 2, dest + 2),    100.0%,    576.9%,    173.9%,   1000.7%
  2048 bytes (source + 2, dest + 2),    100.0%,    598.2%,    174.4%,   1357.3%
  4096 bytes (source + 2, dest + 2),    100.0%,    609.4%,    174.7%,   1661.9%
  8192 bytes (source + 2, dest + 2),    100.0%,    583.9%,    173.5%,   1826.6%
 16384 bytes (source + 2, dest + 2),    100.0%,    268.7%,    148.9%,   2347.5%
 32768 bytes (source + 2, dest + 2),    100.0%,    269.3%,    149.2%,    472.0%
 65536 bytes (source + 2, dest + 2),    100.0%,    269.7%,    149.3%,    473.4%
     1 bytes (source + 3, dest + 1),    100.0%,     91.0%,     55.9%,    100.8%
     2 bytes (source + 3, dest + 1),    100.0%,     92.1%,     71.8%,    100.6%
     4 bytes (source + 3, dest + 1),    100.0%,     84.1%,     75.9%,    100.9%
     8 bytes (source + 3, dest + 1),    100.0%,    100.9%,     79.9%,     92.9%
    16 bytes (source + 3, dest + 1),    100.0%,    134.2%,     84.2%,    113.4%
    32 bytes (source + 3, dest + 1),    100.0%,    166.1%,     85.5%,    129.2%
    64 bytes (source + 3, dest + 1),    100.0%,    220.9%,     86.5%,    133.5%
   128 bytes (source + 3, dest + 1),    100.0%,    266.0%,     86.9%,    159.0%
   256 bytes (source + 3, dest + 1),    100.0%,    297.8%,     87.2%,    227.6%
   512 bytes (source + 3, dest + 1),    100.0%,    316.8%,     87.4%,    293.4%
  1024 bytes (source + 3, dest + 1),    100.0%,    327.6%,     87.4%,    341.5%
  2048 bytes (source + 3, dest + 1),    100.0%,    333.3%,     87.5%,    373.9%
  4096 bytes (source + 3, dest + 1),    100.0%,    336.2%,     87.5%,    393.2%
  8192 bytes (source + 3, dest + 1),    100.0%,    332.0%,     87.6%,    404.2%
 16384 bytes (source + 3, dest + 1),    100.0%,    228.1%,     91.3%,    556.8%
 32768 bytes (source + 3, dest + 1),    100.0%,    228.5%,     91.3%,    322.1%
 65536 bytes (source + 3, dest + 1),    100.0%,    228.7%,     91.2%,    322.9%
Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Flatmush
Has a normal user title
Administrator
Hero Member
*

Karma: +84/-26
Offline Offline

Posts: 1046
12906.27 points

View Inventory
Send Money to Flatmush

The Omniscient One


View Profile WWW
« Reply #19 on: September 03, 2008, 10:06:19 AM »

I think I realize why I got a speedup with the invented types actually, cause I did do a benchmark and the types did make it faster for some reason.
I'm pretty sure its because when optimized it does more  copies in the loop, so the overhead of the branch is much less.

i.e for the u256, it would do 8 word copies inside the loop for every iteration thus the branch overhead is much less, probably only takes effect when the data is in the cache though.
« Last Edit: September 03, 2008, 10:15:19 AM by Flatmush » Logged

Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c

I am nerdier than 66% of all people. Are you nerdier? Click here to find out!I am 62% loser. What about you? Click here to find out!NerdTests.com User Test: The Can I Run A Business Test.

Hehe I'm a "Hero Member" because I bought posts back when they were in the shop.

Creator of FlatEditPSP, funcLib and flAstro
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #20 on: September 03, 2008, 02:26:15 PM »

Hi Raphael,

I added your changes to my code, and got the following results
Code:
                                        memcpy     daniel   flatmush    raphael       libc

     1 bytes (    64 bytes aligned),    100.0%,     19.0%,     14.1%,     20.7%,     20.8%
     2 bytes (    64 bytes aligned),    100.0%,     25.1%,     20.8%,     27.2%,     27.2%
     4 bytes (    64 bytes aligned),    100.0%,     16.1%,     20.2%,     16.1%,     16.1%
     8 bytes (    64 bytes aligned),    100.0%,     20.6%,     37.6%,     32.8%,     26.1%
    16 bytes (    64 bytes aligned),    100.0%,     31.2%,     62.6%,     44.4%,     55.1%
    32 bytes (    64 bytes aligned),    100.0%,     59.3%,     99.3%,     68.9%,     81.2%
    64 bytes (    64 bytes aligned),    100.0%,    111.0%,    159.6%,    193.1%,    131.9%
   128 bytes (    64 bytes aligned),    100.0%,    145.9%,    182.9%,    305.8%,    153.9%
   256 bytes (    64 bytes aligned),    100.0%,    174.1%,    198.3%,    431.5%,    168.1%
   512 bytes (    64 bytes aligned),    100.0%,    192.8%,    206.9%,    547.8%,    176.4%
  1024 bytes (    64 bytes aligned),    100.0%,    204.0%,    211.6%,    624.2%,    180.9%
  2048 bytes (    64 bytes aligned),    100.0%,    210.0%,    214.0%,    681.2%,    183.2%
  4096 bytes (    64 bytes aligned),    100.0%,    213.2%,    215.2%,    713.9%,    184.4%
  8192 bytes (    64 bytes aligned),    100.0%,    216.2%,    217.2%,    733.7%,    186.2%
 16384 bytes (    64 bytes aligned),    100.0%,    137.6%,    137.8%,   1408.2%,    132.2%
 32768 bytes (    64 bytes aligned),    100.0%,    127.8%,    128.0%,    224.5%,    122.7%
 65536 bytes (    64 bytes aligned),    100.0%,    127.8%,    128.0%,    224.7%,    122.7%

Notes:
1.) I'm added one column libc (the libc code you added to the source)
2.) Changed the alignment to 64 bytes
3.) I didn't managed to link with sceKernelIcacheInvalidateAll(), so I skipped it
4.) My code is in a C++ file (but I tried to put Flatmush code in a seperated c file before and got the same results)
5.) I added checks for 8,16,32 bytes alignment (base + 8,16,32), but got the almost same results
6.) The first column uses the default memcpy(), and I see that the compiler is doing some sort of inlining here when size < 64 bytes, see below

memcpy's using a (volatile) size variable instead of a constant value
Code:
                                        memcpy     daniel   flatmush    raphael       libc

     1 bytes (    64 bytes aligned),    100.0%,    108.0%,     80.7%,    118.1%,    118.1%
     2 bytes (    64 bytes aligned),    100.0%,    107.1%,     88.3%,    114.6%,    115.2%
     4 bytes (    64 bytes aligned),    100.0%,     75.3%,     94.0%,     74.7%,     75.0%
     8 bytes (    64 bytes aligned),    100.0%,     60.7%,    109.6%,     96.4%,     76.6%
    16 bytes (    64 bytes aligned),    100.0%,     52.0%,    103.3%,     73.8%,     91.2%
    32 bytes (    64 bytes aligned),    100.0%,     68.3%,    113.7%,     79.1%,     92.6%
    64 bytes (    64 bytes aligned),    100.0%,     78.7%,    112.8%,    136.5%,     93.4%
   128 bytes (    64 bytes aligned),    100.0%,     89.4%,    111.9%,    186.3%,     94.2%
   256 bytes (    64 bytes aligned),    100.0%,     98.0%,    111.5%,    242.2%,     94.6%
   512 bytes (    64 bytes aligned),    100.0%,    103.8%,    111.3%,    293.9%,     94.9%
  1024 bytes (    64 bytes aligned),    100.0%,    107.2%,    111.2%,    328.4%,     95.1%
  2048 bytes (    64 bytes aligned),    100.0%,    109.1%,    111.1%,    354.3%,     95.2%
  4096 bytes (    64 bytes aligned),    100.0%,    110.1%,    111.1%,    367.5%,     95.2%
  8192 bytes (    64 bytes aligned),    100.0%,    110.4%,    110.9%,    381.3%,     95.3%
 16384 bytes (    64 bytes aligned),    100.0%,    103.9%,    104.0%,   1017.5%,     99.7%
 32768 bytes (    64 bytes aligned),    100.0%,    103.9%,    104.1%,    182.6%,     99.8%
 65536 bytes (    64 bytes aligned),    100.0%,    103.9%,    104.1%,    182.7%,     99.8%

[EDIT]
code change
Code:
#define COPY_LOOP(d, s, a)                                                      \
    {                                                                           \
        volatile int bytes = 1;                                                 \
        COPY(d, s, bytes, a) bytes*=2;                                          \
        COPY(d, s, bytes, a) bytes*=2;                                          \
        etc.                                                                    \
    }

Noware
« Last Edit: September 03, 2008, 02:45:53 PM by Noware » Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #21 on: September 03, 2008, 02:55:17 PM »

Hi Raphael,

I added your changes to my code, and got the following results

Notes:
1.) I'm added one column libc (the libc code you added to the source)
2.) Changed the alignment to 64 bytes
3.) I didn't managed to link with sceKernelIcacheInvalidateAll(), so I skipped it
4.) My code is in a C++ file (but I tried to put Flatmush code in a seperated c file before and got the same results)
5.) I added checks for 8,16,32 bytes alignment (base + 8,16,32), but got the almost same results
6.) The first column uses the default memcpy(), and I see that the compiler is doing some sort of inlining here when size < 64 bytes, see below

memopy using a (volatile) size variable instead of a constant value
Code:
                                        memcpy     daniel   flatmush    raphael       libc

     1 bytes (    64 bytes aligned),    100.0%,    108.0%,     80.7%,    118.1%,    118.1%
     2 bytes (    64 bytes aligned),    100.0%,    107.1%,     88.3%,    114.6%,    115.2%
     4 bytes (    64 bytes aligned),    100.0%,     75.3%,     94.0%,     74.7%,     75.0%
     8 bytes (    64 bytes aligned),    100.0%,     60.7%,    109.6%,     96.4%,     76.6%
    16 bytes (    64 bytes aligned),    100.0%,     52.0%,    103.3%,     73.8%,     91.2%
    32 bytes (    64 bytes aligned),    100.0%,     68.3%,    113.7%,     79.1%,     92.6%
    64 bytes (    64 bytes aligned),    100.0%,     78.7%,    112.8%,    136.5%,     93.4%
   128 bytes (    64 bytes aligned),    100.0%,     89.4%,    111.9%,    186.3%,     94.2%
   256 bytes (    64 bytes aligned),    100.0%,     98.0%,    111.5%,    242.2%,     94.6%
   512 bytes (    64 bytes aligned),    100.0%,    103.8%,    111.3%,    293.9%,     94.9%
  1024 bytes (    64 bytes aligned),    100.0%,    107.2%,    111.2%,    328.4%,     95.1%
  2048 bytes (    64 bytes aligned),    100.0%,    109.1%,    111.1%,    354.3%,     95.2%
  4096 bytes (    64 bytes aligned),    100.0%,    110.1%,    111.1%,    367.5%,     95.2%
  8192 bytes (    64 bytes aligned),    100.0%,    110.4%,    110.9%,    381.3%,     95.3%
 16384 bytes (    64 bytes aligned),    100.0%,    103.9%,    104.0%,   1017.5%,     99.7%
 32768 bytes (    64 bytes aligned),    100.0%,    103.9%,    104.1%,    182.6%,     99.8%
 65536 bytes (    64 bytes aligned),    100.0%,    103.9%,    104.1%,    182.7%,     99.8%
Noware

1. That's the code I c&p from newlib Smile
3. That is declared in psputilsforkernel.h and linked in with -lpspkernel
4. should still work Wink
5. That's to be expected, as the most implementations work fine for any multiple of 4 byte alignment (inner loop consists of 32bit copies). The other cases where bigger alignment is of benefit (in my implementation) only kicks in for bigger copies, where the overhead of checking alignment/realigning is ruled by the mem instructions latencies.
6. Yep, that's why I chose to create a local copy of newlibs memcpy and declare it with noinline attribute. Working with a (volatile) size variable seems to give more reliable results too (they look comparable to my last results). I still can't explain the significant 4/8 byte copy difference between memcpy and libc column. Maybe disabling interrupts might give better results for you too:
Code:
        s32 intc = pspSdkDisableInterrupts();\
        u64 time = GetCurrentTick();                                            \
        int j;\
        for (j=0; j<1000; ++j)                                              \
            memcpy_libc(d, s, n);                                                    \
        gcc_elapsed = (int)(GetCurrentTick()-time);                             \
        pspSdkEnableInterrupts(intc);\
Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #22 on: September 03, 2008, 03:06:56 PM »

Hi Raphael,

I added en/disable interrupts, noinline attribute, etc.
I will try -lpspkernel later

Noware
Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #23 on: September 04, 2008, 09:06:32 AM »

Hi Raphael,

I think we can say, your implementation is the fastest in all cases and even in the cases where it is slower then daniels of flatmushs code it's mostly faster then the default memcpy, so I think I will use it to override the default memcpy

thx,
 Noware
Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #24 on: December 12, 2009, 02:10:42 PM »

After stumbling upon this old thread, I noticed that I never actually posted the updated and final code of my memcpy function Very Happy

Code:
void* memcpy_vfpu( void* dst, void* src, unsigned int size )
{
u8* src8 = (u8*)src;
u8* dst8 = (u8*)dst;

// < 8 isn't worth trying any optimisations...
if (size<8) goto bytecopy;

// < 64 means we don't gain anything from using vfpu...
if (size<64)
{
// Align dst on 4 bytes or just resume if already done
while (((((u32)dst8) & 0x3)!=0) && size) {
*dst8++ = *src8++;
size--;
}
if (size<4) goto bytecopy;

// We are dst aligned now and >= 4 bytes to copy
u32* src32 = (u32*)src8;
u32* dst32 = (u32*)dst8;
switch(((u32)src8)&0x3)
{
case 0:
while (size&0xC)
{
*dst32++ = *src32++;
size -= 4;
}
if (size==0) return (dst); // fast out
while (size>=16)
{
*dst32++ = *src32++;
*dst32++ = *src32++;
*dst32++ = *src32++;
*dst32++ = *src32++;
size -= 16;
}
if (size==0) return (dst); // fast out
src8 = (u8*)src32;
dst8 = (u8*)dst32;
break;
default:
{
register u32 a, b, c, d;
while (size>=4)
{
a = *src8++;
b = *src8++;
c = *src8++;
d = *src8++;
*dst32++ = (d << 24) | (c << 16) | (b << 8) | a;
size -= 4;
}
if (size==0) return (dst); // fast out
dst8 = (u8*)dst32;
}
break;
}
goto bytecopy;
}

// Align dst on 16 bytes to gain from vfpu aligned stores
while ((((u32)dst8) & 0xF)!=0 && size) {
*dst8++ = *src8++;
size--;
}

// We use uncached dst to use VFPU writeback and free cpu cache for src only
u8* udst8 = (u8*)((u32)dst8 | 0x40000000);
// We need the 64 byte aligned address to make sure the dcache is invalidated correctly
u8* dst64a = ((u32)dst8&~0x3F);
// Invalidate the first line that matches up to the dst start
if (size>=64)
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"cache 0x1B, 0(%0)\n"
"addiu %0, %0, 64\n"
"sync\n"
".set pop\n"
:"+r"(dst64a));
switch(((u32)src8&0xF))
{
// src aligned on 16 bytes too? nice!
case 0:
while (size>=64)
{
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"cache 0x1B,  0(%2)\n" // Dcache writeback invalidate
"lv.q c000,  0(%1)\n"
"lv.q c010, 16(%1)\n"
"lv.q c020, 32(%1)\n"
"lv.q c030, 48(%1)\n"
"sync\n" // Wait for allegrex writeback
"sv.q c000,  0(%0), wb\n"
"sv.q c010, 16(%0), wb\n"
"sv.q c020, 32(%0), wb\n"
"sv.q c030, 48(%0), wb\n"
// Lots of variable updates... but get hidden in sv.q latency anyway
"addiu  %3, %3, -64\n"
"addiu %2, %2, 64\n"
"addiu %1, %1, 64\n"
"addiu %0, %0, 64\n"
".set pop\n" // restore assembler option
:"+r"(udst8),"+r"(src8),"+r"(dst64a),"+r"(size)
:
:"memory"
);
}
if (size>16)
{
// Invalidate the last cache line where the max remaining 63 bytes are
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"cache 0x1B, 0(%0)\n"
"sync\n"
".set pop\n" // restore assembler option
::"r"(dst64a));
while (size>=16)
{
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"lv.q c000, 0(%1)\n"
"sv.q c000, 0(%0), wb\n"
// Lots of variable updates... but get hidden in sv.q latency anyway
"addiu %2, %2, -16\n"
"addiu %1, %1, 16\n"
"addiu %0, %0, 16\n"
".set pop\n" // restore assembler option
:"+r"(udst8),"+r"(src8),"+r"(size)
:
:"memory"
);
}
}
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"vflush\n" // Flush VFPU writeback cache
".set pop\n" // restore assembler option
);
dst8 = (u8*)((u32)udst8 & ~0x40000000);
break;
// src is only qword unaligned but word aligned? We can at least use ulv.q
case 4:
case 8:
case 12:
while (size>=64)
{
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"cache 0x1B,  0(%2)\n" // Dcache writeback invalidate
"ulv.q c000,  0(%1)\n"
"ulv.q c010, 16(%1)\n"
"ulv.q c020, 32(%1)\n"
"ulv.q c030, 48(%1)\n"
"sync\n" // Wait for allegrex writeback
"sv.q c000,  0(%0), wb\n"
"sv.q c010, 16(%0), wb\n"
"sv.q c020, 32(%0), wb\n"
"sv.q c030, 48(%0), wb\n"
// Lots of variable updates... but get hidden in sv.q latency anyway
"addiu  %3, %3, -64\n"
"addiu %2, %2, 64\n"
"addiu %1, %1, 64\n"
"addiu %0, %0, 64\n"
".set pop\n" // restore assembler option
:"+r"(udst8),"+r"(src8),"+r"(dst64a),"+r"(size)
:
:"memory"
);
}
if (size>16)
// Invalidate the last cache line where the max remaining 63 bytes are
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"cache 0x1B, 0(%0)\n"
"sync\n"
".set pop\n" // restore assembler option
::"r"(dst64a));
while (size>=16)
{
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"ulv.q c000, 0(%1)\n"
"sv.q c000, 0(%0), wb\n"
// Lots of variable updates... but get hidden in sv.q latency anyway
"addiu %2, %2, -16\n"
"addiu %1, %1, 16\n"
"addiu %0, %0, 16\n"
".set pop\n" // restore assembler option
:"+r"(udst8),"+r"(src8),"+r"(size)
:
:"memory"
);
}
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"vflush\n" // Flush VFPU writeback cache
".set pop\n" // restore assembler option
);
dst8 = (u8*)((u32)udst8 & ~0x40000000);
break;
// src not aligned? too bad... have to use unaligned reads
default:
while (size>=64)
{
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"cache 0x1B,  0(%2)\n"

"lwr $8,  0(%1)\n" //
"lwl $8,  3(%1)\n" // $8  = *(s + 0)
"lwr $9,  4(%1)\n" //
"lwl $9,  7(%1)\n" // $9  = *(s + 4)
"lwr $10,  8(%1)\n" //
"lwl $10, 11(%1)\n" // $10 = *(s + 8)
"lwr $11, 12(%1)\n" //
"lwl $11, 15(%1)\n" // $11 = *(s + 12)
"mtv $8, s000\n"
"mtv $9, s001\n"
"mtv $10, s002\n"
"mtv $11, s003\n"

"lwr $8, 16(%1)\n"
"lwl $8, 19(%1)\n"
"lwr $9, 20(%1)\n"
"lwl $9, 23(%1)\n"
"lwr $10, 24(%1)\n"
"lwl $10, 27(%1)\n"
"lwr $11, 28(%1)\n"
"lwl $11, 31(%1)\n"
"mtv $8, s010\n"
"mtv $9, s011\n"
"mtv $10, s012\n"
"mtv $11, s013\n"

"lwr $8, 32(%1)\n"
"lwl $8, 35(%1)\n"
"lwr $9, 36(%1)\n"
"lwl $9, 39(%1)\n"
"lwr $10, 40(%1)\n"
"lwl $10, 43(%1)\n"
"lwr $11, 44(%1)\n"
"lwl $11, 47(%1)\n"
"mtv $8, s020\n"
"mtv $9, s021\n"
"mtv $10, s022\n"
"mtv $11, s023\n"

"lwr $8, 48(%1)\n"
"lwl $8, 51(%1)\n"
"lwr $9, 52(%1)\n"
"lwl $9, 55(%1)\n"
"lwr $10, 56(%1)\n"
"lwl $10, 59(%1)\n"
"lwr $11, 60(%1)\n"
"lwl $11, 63(%1)\n"
"mtv $8, s030\n"
"mtv $9, s031\n"
"mtv $10, s032\n"
"mtv $11, s033\n"

"sync\n"
"sv.q c000,  0(%0), wb\n"
"sv.q c010, 16(%0), wb\n"
"sv.q c020, 32(%0), wb\n"
"sv.q c030, 48(%0), wb\n"
// Lots of variable updates... but get hidden in sv.q latency anyway
"addiu %3, %3, -64\n"
"addiu %2, %2, 64\n"
"addiu %1, %1, 64\n"
"addiu %0, %0, 64\n"
".set pop\n" // restore assembler option
:"+r"(udst8),"+r"(src8),"+r"(dst64a),"+r"(size)
:
:"$8","$9","$10","$11","memory"
);
}
if (size>16)
// Invalidate the last cache line where the max remaining 63 bytes are
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"cache 0x1B, 0(%0)\n"
"sync\n"
".set pop\n" // restore assembler option
::"r"(dst64a));
while (size>=16)
{
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"lwr $8,  0(%1)\n" //
"lwl $8,  3(%1)\n" // $8  = *(s + 0)
"lwr $9,  4(%1)\n" //
"lwl $9,  7(%1)\n" // $9  = *(s + 4)
"lwr $10,  8(%1)\n" //
"lwl $10, 11(%1)\n" // $10 = *(s + 8)
"lwr $11, 12(%1)\n" //
"lwl $11, 15(%1)\n" // $11 = *(s + 12)
"mtv $8, s000\n"
"mtv $9, s001\n"
"mtv $10, s002\n"
"mtv $11, s003\n"

"sv.q c000, 0(%0), wb\n"
// Lots of variable updates... but get hidden in sv.q latency anyway
"addiu %2, %2, -16\n"
"addiu %1, %1, 16\n"
"addiu %0, %0, 16\n"
".set pop\n" // restore assembler option
:"+r"(udst8),"+r"(src8),"+r"(size)
:
:"$8","$9","$10","$11","memory"
);
}
asm(".set push\n" // save assembler option
".set noreorder\n" // suppress reordering
"vflush\n" // Flush VFPU writeback cache
".set pop\n" // restore assembler option
);
dst8 = (u8*)((u32)udst8 & ~0x40000000);
break;
}

bytecopy:
// Copy the remains byte per byte...
while (size--)
{
*dst8++ = *src8++;
}

return (dst);
}
Who said only short code is fast? Smile
Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #25 on: December 13, 2009, 01:58:58 AM »

Hi Raphael,

I finally never used your memcopy since it gave me some errors, but thx I will try your version of memcopy now

[EDIT]
Yes now it works without crashing Wink

Noware
« Last Edit: December 13, 2009, 02:27:36 AM by Noware » Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #26 on: December 13, 2009, 05:48:44 AM »

Why didn't you hit me hard on my head and tell me to post the final version then Smile
Well, at least it's there now.
Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Bluddy
Newbie
*

Karma: +0/-0
Offline Offline

Posts: 12
1411.75 points

View Inventory
Send Money to Bluddy

View Profile
« Reply #27 on: June 17, 2010, 06:11:53 AM »

Sorry to bring up this old thread.

Raphael (or anyone else) does this mean that the PSP has another cache just for the VFPU? That's what I'm getting from what Raphael said here.
Logged
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #28 on: November 13, 2010, 06:46:52 PM »

If I had come by earlier again, I'd have answered earlier.. :/

The thing is, the VFPU at least has a writeback cache, which isn't used to full potential unless the asm code is specially written for it. I'm not sure if the VFPU actually has a full own cache, but I doubt it seeing how normal vfpu ops make use of the CPU cache unless explicitly disabled with uncached memory adresses.

The difference is that the WB cache is much simpler in design (and hence transistor cost), because it only stores (caches) X write operations before sending them to the memory interface.
Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Pages: 1 [2]
Print
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC Valid XHTML 1.0! Valid CSS!
Page created in 0.443 seconds with 37 queries.
Sister Sites: Guitar Hero 4   BrokeniTouch.com