Skip to: Site menu | Main content


Welcome to PSP-Programming.com, a place for developers to get together.

Welcome to the forums. Here you can find other user tutorials as well as homebrew releases and the source code repository. You can also ask for help with your code here and post your own homebrew!

PSP-Programming.com Forums
February 08, 2012, 06:14:28 PM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length

News: Check out the Code Section!
Home Help Search Shop Login Register
Digg This!
Pages: [1] 2
Print
Author Topic: fast memcpy  (Read 12398 times)
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« on: April 12, 2008, 03:35:59 PM »

Here is the source code of a free and fast memcpy (I forgot the stats, but on a PSP and if you setup is right it's much faster then the default memcpy, that just copy 'N' bytes)

http://www.vik.cc/daniel/portfolio/

to override the default memcpy you have to add --allow-multiple-definition to your linker flags and setup the linker order so that the custom memcpy is linked before the default memcpy
LDFLAGS += -Wl,--allow-multiple-definition

Cash
Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!


Flatmush
Has a normal user title
Administrator
Hero Member
*

Karma: +84/-26
Offline Offline

Posts: 1046
12906.27 points

View Inventory
Send Money to Flatmush

The Omniscient One


View Profile WWW
« Reply #1 on: April 12, 2008, 04:55:58 PM »

I made a psp specific version of this long ago, look here http://www.psp-programming.com/forums/index.php?topic=2731.0

Also the psp bus width is 256, so the fastest method is to copy memory 256 bits at a time  providing it is correctly aligned (just expand the memcpy function there, not that hard).
Logged

Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c

I am nerdier than 66% of all people. Are you nerdier? Click here to find out!I am 62% loser. What about you? Click here to find out!NerdTests.com User Test: The Can I Run A Business Test.

Hehe I'm a "Hero Member" because I bought posts back when they were in the shop.

Creator of FlatEditPSP, funcLib and flAstro
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #2 on: April 13, 2008, 02:34:13 AM »

Hi Flatmush,

I don't say this is the fastest memcpy, especially when you know your data is aligned, etc, but for all other cases (like operator=, etc) it's a good replacement for the default memcpy

Also I like to mention that copies of less then 8 bytes get more expensive by all the if checks, so I advice people to first check count/size.
A change that I made in Daniel Vik code is:

Code:
   if (count < 8) {
        if (count >= 4 && ((((u32)src8 | (u32)dst8)) & 3) == 0) {
            *((u32 *)dst8) = *((u32 *)src8);
            dst8  += 4;
            src8  += 4;
            count -= 4;
        }

        START_VAL(dst8);
        START_VAL(src8);

        while (count--) {
            INC_VAL(dst8) = INC_VAL(src8);
        }

        return dest;
    }

to

Code:
   if (count < 8) {
        START_VAL(dst8);
        START_VAL(src8);

        while (count--) {
            INC_VAL(dst8) = INC_VAL(src8);
        }

        return dest;
    }

I think the same is true for Flatmush memCopy

Cash
« Last Edit: April 13, 2008, 03:26:09 AM by Cash » Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Flatmush
Has a normal user title
Administrator
Hero Member
*

Karma: +84/-26
Offline Offline

Posts: 1046
12906.27 points

View Inventory
Send Money to Flatmush

The Omniscient One


View Profile WWW
« Reply #3 on: April 13, 2008, 04:16:32 AM »

Yeah but that adds yet more checks for longer copies and memcpy should only be used for relatively large copies. If you're using memcpy for less than 8 bytes you probably shouldn't be.
Also remember that reading memory is much slower than any other operation so in comparison the checks are a very small overhead.

Oh and for the free memory function. Apparently the meminfo way isn't 100% accurate, I forget the reason but if you check ps2dev then there should be a topic about it there. That free memory function I pasted is the fastest method for getting free memory that you can get by only using mallocs and frees. Besides knowing free memory is usually more of a debug feature, I've never known a reason to need a high performance free memory. The best way if you want it to be fast is to keep track of mallocs and frees which is easy enough.
« Last Edit: April 13, 2008, 04:29:46 AM by Flatmush » Logged

Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c

I am nerdier than 66% of all people. Are you nerdier? Click here to find out!I am 62% loser. What about you? Click here to find out!NerdTests.com User Test: The Can I Run A Business Test.

Hehe I'm a "Hero Member" because I bought posts back when they were in the shop.

Creator of FlatEditPSP, funcLib and flAstro
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #4 on: April 13, 2008, 08:34:58 AM »

Hi Flatmush,

Quote
Yeah but that adds yet more checks for longer copies and memcpy should only be used for relatively large copies. If you're using memcpy for less than 8 bytes you probably shouldn't be
but the default operator= will also use memcpy for small structures / classes

Quote
Oh and for the free memory function. Apparently the meminfo way isn't 100% accurate, I forget the reason but if you check ps2dev then there should be a topic about it there. That free memory function I pasted is the fastest method for getting free memory that you can get by only using mallocs and frees. Besides knowing free memory is usually more of a debug feature, I've never known a reason to need a high performance free memory. The best way if you want it to be fast is to keep track of mallocs and frees which is easy enough.
Ok, you are right, lucky I use my own memory manager (and malloc, free, new, delete, etc also use my memory manager) and getting the exact amount of free memory is in my case totally free Wink

Cash
« Last Edit: April 13, 2008, 08:41:07 AM by Cash » Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Flatmush
Has a normal user title
Administrator
Hero Member
*

Karma: +84/-26
Offline Offline

Posts: 1046
12906.27 points

View Inventory
Send Money to Flatmush

The Omniscient One


View Profile WWW
« Reply #5 on: April 13, 2008, 08:43:58 AM »

Quote
but the default operator= will use memcpy also for small structures / classes.
Dunno where you heard that but it's totally untrue. I did some benchmarks using the equals operator on a 256-bit (32 byte) structure in my memCopy and it was much faster than memcpy. If memcpy was used then my function would have been much slower with performance through the floor, but it wasn't.
If you download funclib and look at some of the memory alignment and memory speed benchmarking programs included you can see that.

Actually it may be used in C++ for classes (not structures), but thats one of the many reasons that I don't use C++. Still with optimizations enabled, I can't see why a good compiler would use memcpy rather than pure assembly.
Logged

Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c

I am nerdier than 66% of all people. Are you nerdier? Click here to find out!I am 62% loser. What about you? Click here to find out!NerdTests.com User Test: The Can I Run A Business Test.

Hehe I'm a "Hero Member" because I bought posts back when they were in the shop.

Creator of FlatEditPSP, funcLib and flAstro
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #6 on: April 13, 2008, 09:32:18 AM »

Sorry I was taking about sizes of less then 8 bytes, I truly belief that your funtion is faster then the default memcpy, it looks faster anyway...

Also memcpy is used by other libs, and sometimes they even don't know amount of bytes that they have to copy.

Quote
Actually it may be used in C++ for classes (not structures)
Classes and structures are the same thing in C++, except that a class is protected and a structure public

Cash
Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #7 on: September 01, 2008, 02:13:10 PM »

Hi,

I know this is a old topic, but I (think) this is nice to know

Finally I did what Flatmush asked my to do, profile different kind of memcpy's

To my surprise things have changes a lot since 2 ago (note I never tested Flatmush memcpy before)
when I tested this 2 years ago Daniel Vik's memcpy was the fastest overall even in small blockes of memory

First the result:
GCC has the fasted memcpy for small blocks of memory, Daniel Vik's (see http://www.vik.cc/daniel/portfolio) memcpy is the fastest memcpy for large aligned blocks of memory but the slowest for small blocks of memory, and finally (sorry) Flatmush (see http://www.psp-programming.com/forums/index.php/topic,2731.0.html) memcpy is overall the slowest memcpy

My test is done GCC 4.3.1 with -O5 added to the compiler flags, and counts the number of ticks for coping 1000 times X amount of bytes.


here is the profile code, so you can do the test your self (or find bugs in my code)
Code:
i64 GetCurrentTick()
{
    u64 tick;
    sceRtcGetCurrentTick(&tick);
    return (i64)tick;
}

void ProfileMemCopy()
{
    unsigned char* block1 = (unsigned char*)malloc(65536 + 32);
    unsigned char* block2 = (unsigned char*)malloc(65536 + 32);

#define COPY(d, s, n, a) {                                                      \
    int gcc_elapsed = 0;                                                        \
    {                                                                           \
        u64 time = GetCurrentTick();                                            \
        for (int j=0; j<1000; ++j)                                              \
            memcpy(d, s, n);                                                    \
        gcc_elapsed = (int)(GetCurrentTick()-time);                             \
    }                                                                           \
    int danielvik_elapsed = 0;                                                  \
    {                                                                           \
        u64 time = GetCurrentTick();                                            \
        for (int j=0; j<1000; ++j)                                              \
            MemCopyDanielVik(d, s, n);                                          \
        danielvik_elapsed = (int)(GetCurrentTick()-time);                       \
    }                                                                           \
    int flatmush_elapsed = 0;                                                   \
    {                                                                           \
        u64 time = GetCurrentTick();                                            \
        for (int j=0; j<1000; ++j)                                              \
            MemCopyFlatmush(d, s, n);                                           \
        flatmush_elapsed = (int)(GetCurrentTick()-time);                        \
    }                                                                           \
    scePowerTick(0);                                                            \
    printf("%6d bytes (%20s), %12d, %12d, %12d\n", (int)n, a, gcc_elapsed, danielvik_elapsed, flatmush_elapsed); \
    }

#define COPY_LOOP(d, s, a)                                                      \
    COPY(d, s, 1, a)                                                            \
    COPY(d, s, 2, a)                                                            \
    COPY(d, s, 4, a)                                                            \
    COPY(d, s, 8, a)                                                            \
    COPY(d, s, 16, a)                                                           \
    COPY(d, s, 32, a)                                                           \
    COPY(d, s, 64, a)                                                           \
    COPY(d, s, 128, a)                                                          \
    COPY(d, s, 256, a)                                                          \
    COPY(d, s, 512, a)                                                          \
    COPY(d, s, 1024, a)                                                         \
    COPY(d, s, 2048, a)                                                         \
    COPY(d, s, 4096, a)                                                         \
    COPY(d, s, 8192, a)                                                         \
    COPY(d, s, 16384, a)                                                        \
    COPY(d, s, 32768, a)                                                        \
    COPY(d, s, 65536, a)

    COPY_LOOP(block1, block2, "4 bytes aligned");
    COPY_LOOP(&block1[1], block2, "source + 1");
    COPY_LOOP(&block1[2], block2, "source + 2");
    COPY_LOOP(&block1[3], block2, "source + 3");
    COPY_LOOP(block1, &block2[1], "dest + 1");
    COPY_LOOP(block1, &block2[2], "dest + 2");
    COPY_LOOP(block1, &block2[3], "dest + 3");
    COPY_LOOP(&block1[1], &block2[3], "source + 1, dest + 3");
    COPY_LOOP(&block1[2], &block2[2], "source + 2, dest + 2");
    COPY_LOOP(&block1[3], &block2[1], "source + 3, dest + 1");

    // TODO Add unaligned copy tests

    free(block1);
    free(block2);

#undef COPY
#undef COPY_LOOP
}

here are the results:
Code:
       bytes    src / dst alignment           GCC    Daniel Vik      Flatmush

     1 bytes (     4 bytes aligned),           30,          157,           90
     2 bytes (     4 bytes aligned),           47,          193,          148
     4 bytes (     4 bytes aligned),           44,          266,          102
     8 bytes (     4 bytes aligned),           80,          388,          174
    16 bytes (     4 bytes aligned),          148,          470,          262
    32 bytes (     4 bytes aligned),          293,          493,          424
    64 bytes (     4 bytes aligned),          727,          657,          750
   128 bytes (     4 bytes aligned),         1434,          981,         1401
   256 bytes (     4 bytes aligned),         2847,         1768,         2742
   512 bytes (     4 bytes aligned),         5716,         2936,         5440
  1024 bytes (     4 bytes aligned),        11460,         5594,        10528
  2048 bytes (     4 bytes aligned),        22820,        10861,        21069
  4096 bytes (     4 bytes aligned),        45630,        21384,        42103
  8192 bytes (     4 bytes aligned),        93212,        43205,        90628
 16384 bytes (     4 bytes aligned),       364419,       274365,       346937
 32768 bytes (     4 bytes aligned),       679297,       548399,       692750
 65536 bytes (     4 bytes aligned),      1358228,      1096055,      1384869
     1 bytes (          source + 1),           30,          157,           39
     2 bytes (          source + 1),           48,          193,          184
     4 bytes (          source + 1),           44,          297,          275
     8 bytes (          source + 1),           79,          492,          429
    16 bytes (          source + 1),          148,          601,          755
    32 bytes (          source + 1),          293,          819,         1407
    64 bytes (          source + 1),          728,         1219,         2757
   128 bytes (          source + 1),         1434,         1720,         5355
   256 bytes (          source + 1),         2887,         2916,        10663
   512 bytes (          source + 1),         5675,         5346,        21167
  1024 bytes (          source + 1),        11505,        10086,        42093
  2048 bytes (          source + 1),        22819,        19855,        84032
  4096 bytes (          source + 1),        45656,        39048,       168130
  8192 bytes (          source + 1),        94151,        78420,       355967
 16384 bytes (          source + 1),       353287,       328482,       861445
 32768 bytes (          source + 1),       649488,       656018,      1721655
 65536 bytes (          source + 1),      1298463,      1311346,      3442265
     1 bytes (          source + 2),           30,          157,           61
     2 bytes (          source + 2),           48,          193,          129
     4 bytes (          source + 2),           44,          297,          202
     8 bytes (          source + 2),           80,          484,          275
    16 bytes (          source + 2),          148,          593,          437
    32 bytes (          source + 2),          293,          810,          764
    64 bytes (          source + 2),          728,         1113,         1415
   128 bytes (          source + 2),         1435,         1711,         2756
   256 bytes (          source + 2),         2849,         2945,         5326
   512 bytes (          source + 2),         5718,         5296,        10666
  1024 bytes (          source + 2),        11510,        10185,        21066
  2048 bytes (          source + 2),        22926,        19739,        42020
  4096 bytes (          source + 2),        45692,        39006,        84206
  8192 bytes (          source + 2),        93976,        78564,       179612
 16384 bytes (          source + 2),       353292,       328411,       512551
 32768 bytes (          source + 2),       649345,       655968,      1023611
 65536 bytes (          source + 2),      1298321,      1311197,      2046832
     1 bytes (          source + 3),           30,          157,           39
     2 bytes (          source + 3),           48,          193,          184
     4 bytes (          source + 3),           44,          297,          276
     8 bytes (          source + 3),           80,          434,          430
    16 bytes (          source + 3),          148,          542,          756
    32 bytes (          source + 3),          293,          760,         1407
    64 bytes (          source + 3),          728,         1207,         2711
   128 bytes (          source + 3),         1435,         1661,         5356
   256 bytes (          source + 3),         2848,         2896,        10666
   512 bytes (          source + 3),         5715,         5245,        21174
  1024 bytes (          source + 3),        11459,        10076,        42101
  2048 bytes (          source + 3),        22817,        19689,        84039
  4096 bytes (          source + 3),        45610,        39066,       168157
  8192 bytes (          source + 3),        94141,        78505,       357734
 16384 bytes (          source + 3),       353236,       328402,       865486
 32768 bytes (          source + 3),       649373,       656020,      1729987
 65536 bytes (          source + 3),      1298300,      1311287,      3459011
     1 bytes (            dest + 1),           30,          157,           89
     2 bytes (            dest + 1),           48,          194,          235
     4 bytes (            dest + 1),           44,          297,          316
     8 bytes (            dest + 1),           80,          434,          473
    16 bytes (            dest + 1),          148,          543,          796
    32 bytes (            dest + 1),          293,          638,         1448
    64 bytes (            dest + 1),          728,          938,         2751
   128 bytes (            dest + 1),         1436,         1535,         5476
   256 bytes (            dest + 1),         2848,         2767,        10697
   512 bytes (            dest + 1),         5719,         5254,        21091
  1024 bytes (            dest + 1),        11516,         9901,        42133
  2048 bytes (            dest + 1),        22886,        19596,        84074
  4096 bytes (            dest + 1),        45716,        38819,       168219
  8192 bytes (            dest + 1),        94515,        79535,       354940
 16384 bytes (            dest + 1),       351116,       340543,       858143
 32768 bytes (            dest + 1),       647884,       679938,      1714881
 65536 bytes (            dest + 1),      1294249,      1358843,      3427877
     1 bytes (            dest + 2),           31,          157,           89
     2 bytes (            dest + 2),           49,          198,          153
     4 bytes (            dest + 2),           44,          302,          225
     8 bytes (            dest + 2),           80,          435,          297
    16 bytes (            dest + 2),          148,          543,          461
    32 bytes (            dest + 2),          293,          675,          786
    64 bytes (            dest + 2),          728,          938,         1439
   128 bytes (            dest + 2),         1521,         1582,         2775
   256 bytes (            dest + 2),         2848,         2730,         5521
   512 bytes (            dest + 2),         5717,         5119,        10692
  1024 bytes (            dest + 2),        11508,         9999,        21089
  2048 bytes (            dest + 2),        22928,        19564,        42058
  4096 bytes (            dest + 2),        45589,        38942,        84165
  8192 bytes (            dest + 2),        94340,        79512,       178785
 16384 bytes (            dest + 2),       351121,       340380,       509739
 32768 bytes (            dest + 2),       647927,       679893,      1017893
 65536 bytes (            dest + 2),      1294358,      1358738,      2034686
     1 bytes (            dest + 3),           31,          161,           90
     2 bytes (            dest + 3),           48,          198,          297
     4 bytes (            dest + 3),           44,          302,          316
     8 bytes (            dest + 3),           80,          434,          472
    16 bytes (            dest + 3),          148,          543,          796
    32 bytes (            dest + 3),          293,          638,         1448
    64 bytes (            dest + 3),          728,          939,         2751
   128 bytes (            dest + 3),         1435,         1670,         5394
   256 bytes (            dest + 3),         2848,         2767,        10700
   512 bytes (            dest + 3),         5846,         5120,        21109
  1024 bytes (            dest + 3),        11511,        10003,        42057
  2048 bytes (            dest + 3),        22940,        19565,        84059
  4096 bytes (            dest + 3),        45711,        38831,       168201
  8192 bytes (            dest + 3),        94529,        79514,       353219
 16384 bytes (            dest + 3),       351147,       340521,       853679
 32768 bytes (            dest + 3),       647869,       679923,      1705541
 65536 bytes (            dest + 3),      1294483,      1358795,      3410402
     1 bytes (source + 1, dest + 3),           31,          161,           39
     2 bytes (source + 1, dest + 3),           49,          197,          194
     4 bytes (source + 1, dest + 3),           44,          301,          275
     8 bytes (source + 1, dest + 3),           80,          584,          430
    16 bytes (source + 1, dest + 3),          149,          606,          756
    32 bytes (source + 1, dest + 3),          293,          823,         1407
    64 bytes (source + 1, dest + 3),          728,         1215,         2794
   128 bytes (source + 1, dest + 3),         1435,         1725,         5355
   256 bytes (source + 1, dest + 3),         2849,         2959,        10659
   512 bytes (source + 1, dest + 3),         5719,         5310,        21179
  1024 bytes (source + 1, dest + 3),        11377,        10226,        42070
  2048 bytes (source + 1, dest + 3),        22830,        19751,        84022
  4096 bytes (source + 1, dest + 3),        45703,        39015,       168137
  8192 bytes (source + 1, dest + 3),        95091,        79727,       353672
 16384 bytes (source + 1, dest + 3),       373440,       340565,       858436
 32768 bytes (source + 1, dest + 3),       727850,       679973,      1714968
 65536 bytes (source + 1, dest + 3),      1454780,      1358868,      3429155
     1 bytes (source + 2, dest + 2),           32,          156,           62
     2 bytes (source + 2, dest + 2),           48,          193,          131
     4 bytes (source + 2, dest + 2),           45,          297,          202
     8 bytes (source + 2, dest + 2),           80,          446,          276
    16 bytes (source + 2, dest + 2),          148,          529,          438
    32 bytes (source + 2, dest + 2),          329,          691,          764
    64 bytes (source + 2, dest + 2),          728,          859,         1416
   128 bytes (source + 2, dest + 2),         1435,         1185,         2720
   256 bytes (source + 2, dest + 2),         2937,         1885,         5362
   512 bytes (source + 2, dest + 2),         5676,         3307,        10542
  1024 bytes (source + 2, dest + 2),        11505,         5749,        21169
  2048 bytes (source + 2, dest + 2),        22823,        11076,        42027
  4096 bytes (source + 2, dest + 2),        45695,        21529,        84138
  8192 bytes (source + 2, dest + 2),        95245,        44542,       179694
 16384 bytes (source + 2, dest + 2),       373649,       275084,       512546
 32768 bytes (source + 2, dest + 2),       727856,       548990,      1023607
 65536 bytes (source + 2, dest + 2),      1454913,      1096913,      2046149
     1 bytes (source + 3, dest + 1),           30,          156,           39
     2 bytes (source + 3, dest + 1),           47,          193,          195
     4 bytes (source + 3, dest + 1),           43,          297,          276
     8 bytes (source + 3, dest + 1),           79,          438,          429
    16 bytes (source + 3, dest + 1),          148,          547,          755
    32 bytes (source + 3, dest + 1),          293,          765,         1503
    64 bytes (source + 3, dest + 1),          728,         1068,         2711
   128 bytes (source + 3, dest + 1),         1435,         1666,         5434
   256 bytes (source + 3, dest + 1),         2849,         2861,        10696
   512 bytes (source + 3, dest + 1),         5803,         5300,        21062
  1024 bytes (source + 3, dest + 1),        11510,        10135,        42012
  2048 bytes (source + 3, dest + 1),        22932,        19695,        84023
  4096 bytes (source + 3, dest + 1),        45700,        38957,       168135
  8192 bytes (source + 3, dest + 1),        95248,        79564,       357012
 16384 bytes (source + 3, dest + 1),       373549,       328704,       869061
 32768 bytes (source + 3, dest + 1),       727810,       656416,      1737751
 65536 bytes (source + 3, dest + 1),      1454995,      1311726,      3474254

Noware

« Last Edit: September 01, 2008, 02:32:01 PM by Noware » Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Flatmush
Has a normal user title
Administrator
Hero Member
*

Karma: +84/-26
Offline Offline

Posts: 1046
12906.27 points

View Inventory
Send Money to Flatmush

The Omniscient One


View Profile WWW
« Reply #8 on: September 02, 2008, 02:02:25 AM »

My memcpy is optimized for the psp and is 3 times faster than the newlib memcpy for the psp. GCC doesn't have a memcpy, thats part of newlib, also due to the fact that the psp uses a 256-bit wide bus, my memcpy works best for memory aligned to 256-bits which is 32-byte aligned, it works out faster if you make sure malloc aligns to 32-bits rather than the default 16.

Edit: My memcopy from My Memory Findings isn't the complete thing, its just a sample. Checkout the memCopy in funcLib and you should find it a little more complex but a lot faster. There are still ways to speed it up though, first it needs doing in asm, second it needs to do unaligned copies by aligning and rotational shifting but I got bored and decided it wasn't worth the effort.
Logged

Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c

I am nerdier than 66% of all people. Are you nerdier? Click here to find out!I am 62% loser. What about you? Click here to find out!NerdTests.com User Test: The Can I Run A Business Test.

Hehe I'm a "Hero Member" because I bought posts back when they were in the shop.

Creator of FlatEditPSP, funcLib and flAstro
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #9 on: September 02, 2008, 09:04:35 AM »

Hi Flatmush,

I'm testing normal memcpy's (as a replacement for the default memcpy from newlib) not special cases

Quote
Edit: My memcopy from My Memory Findings isn't the complete thing, its just a sample. Checkout the memCopy in funcLib and you should find it a little more complex but a lot faster.
Ok, where can I find funcLib?

[EDIT] found it, I will post the results later

Quote
There are still ways to speed it up though, first it needs doing in asm, second it needs to do unaligned copies by aligning and rotational shifting but I got bored and decided it wasn't worth the effort.
Yep that is what Daniel Vik's memcpy is doing (without the asm part).

Also I forgot to mention, this test is compiled in C++.

Noware
« Last Edit: September 02, 2008, 09:14:34 AM by Noware » Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #10 on: September 02, 2008, 09:46:17 AM »

Hi Flatmush,

Here are my results with your memcpy from (funcLib_1.0.1), note I also changed the base memory alignment to 32 bytes.

[EDIT]
For making a fair test, I yust put your mempy into a C file and retested it but got the same results.

As you can see your memcpy is the fastest when copying 64 to 16384 bytes (32 bytes aligned).

Code:
       bytes    src / dst alignment        newlib    Daniel Vik      Flatmush

     1 bytes (     4 bytes aligned),           30,          157,          212
     2 bytes (     4 bytes aligned),           47,          290,          234
     4 bytes (     4 bytes aligned),           43,          267,          212
     8 bytes (     4 bytes aligned),           79,          389,          211
    16 bytes (     4 bytes aligned),          147,          471,          236
    32 bytes (     4 bytes aligned),          292,          494,          293
    64 bytes (     4 bytes aligned),          636,          657,          456
   128 bytes (     4 bytes aligned),         1433,          984,          782
   256 bytes (     4 bytes aligned),         2877,         1694,         1433
   512 bytes (     4 bytes aligned),         5804,         2943,         2775
  1024 bytes (     4 bytes aligned),        11319,         5596,         5488
  2048 bytes (     4 bytes aligned),        19822,        10822,        10700
  4096 bytes (     4 bytes aligned),        45485,        21447,        21250
  8192 bytes (     4 bytes aligned),        93736,        44658,        44284
 16384 bytes (     4 bytes aligned),       349527,       253868,       253600
 32768 bytes (     4 bytes aligned),       648711,       506118,       506146
 65536 bytes (     4 bytes aligned),      1297196,      1010467,      1011059

But in all other cases it's the slowest, see below

Code:
       bytes    src / dst alignment        newlib    Daniel Vik      Flatmush

     1 bytes (          source + 1),           29,          158,          239
     2 bytes (          source + 1),           47,          228,          248
     4 bytes (          source + 1),           43,          297,          320
     8 bytes (          source + 1),           79,          493,          466
    16 bytes (          source + 1),          146,          602,          755
    32 bytes (          source + 1),          292,          820,         1335
    64 bytes (          source + 1),          636,         1125,         2492
   128 bytes (          source + 1),         1433,         1724,         4847
   256 bytes (          source + 1),         2845,         2920,         9483
   512 bytes (          source + 1),         5670,         5346,        18848
  1024 bytes (          source + 1),        11417,        10135,        37489
  2048 bytes (          source + 1),        19820,        19852,        74832
  4096 bytes (          source + 1),        45468,        39291,       149658
  8192 bytes (          source + 1),        94298,        79722,       301802
 16384 bytes (          source + 1),       378065,       317656,       798925
 32768 bytes (          source + 1),       729723,       633559,      1597221
 65536 bytes (          source + 1),      1459384,      1265585,      3193823
     1 bytes (          source + 2),           29,          158,          212
     2 bytes (          source + 2),           47,          194,          253
     4 bytes (          source + 2),           43,          297,          272
     8 bytes (          source + 2),           79,          485,          429
    16 bytes (          source + 2),          147,          595,          488
    32 bytes (          source + 2),          293,          812,          778
    64 bytes (          source + 2),          636,         1116,         1492
   128 bytes (          source + 2),         1471,         1713,         2516
   256 bytes (          source + 2),         2845,         2912,         5003
   512 bytes (          source + 2),         5669,         5303,         9651
  1024 bytes (          source + 2),        11431,        10126,        18877
  2048 bytes (          source + 2),        19818,        19839,        37588
  4096 bytes (          source + 2),        45608,        39147,        75010
  8192 bytes (          source + 2),        94125,        79873,       152044
 16384 bytes (          source + 2),       378044,       317580,       501413
 32768 bytes (          source + 2),       729640,       633534,      1001932
 65536 bytes (          source + 2),      1459355,      1265304,      2002851
     1 bytes (          source + 3),           29,          158,          257
     2 bytes (          source + 3),           48,          194,          248
     4 bytes (          source + 3),           43,          298,          321
     8 bytes (          source + 3),           79,          435,          465
    16 bytes (          source + 3),          147,          543,          755
    32 bytes (          source + 3),          292,          761,         1334
    64 bytes (          source + 3),          636,         1065,         2493
   128 bytes (          source + 3),         1467,         1665,         4945
   256 bytes (          source + 3),         2845,         2860,         9612
   512 bytes (          source + 3),         5759,         5252,        18890
  1024 bytes (          source + 3),        11318,        10075,        37590
  2048 bytes (          source + 3),        19820,        19785,        74736
  4096 bytes (          source + 3),        45604,        39089,       149630
  8192 bytes (          source + 3),        94301,        79818,       301671
 16384 bytes (          source + 3),       377978,       317542,       781481
 32768 bytes (          source + 3),       729747,       633426,      1561260
 65536 bytes (          source + 3),      1459342,      1265321,      3120640
     1 bytes (            dest + 1),           29,          157,          299
     2 bytes (            dest + 1),           48,          193,          276
     4 bytes (            dest + 1),           43,          332,          366
     8 bytes (            dest + 1),           79,          435,          593
    16 bytes (            dest + 1),          148,          544,          837
    32 bytes (            dest + 1),          292,          640,         1430
    64 bytes (            dest + 1),          726,          938,         2591
   128 bytes (            dest + 1),         1432,         1596,         4939
   256 bytes (            dest + 1),         2845,         2733,         9707
   512 bytes (            dest + 1),         5669,         5126,        18979
  1024 bytes (            dest + 1),        11419,         9949,        37593
  2048 bytes (            dest + 1),        22716,        19672,        74942
  4096 bytes (            dest + 1),        45598,        39019,       149760
  8192 bytes (            dest + 1),        93764,        79481,       301910
 16384 bytes (            dest + 1),       347007,       317950,       759472
 32768 bytes (            dest + 1),       647689,       634842,      1517598
 65536 bytes (            dest + 1),      1294498,      1268458,      3034157
     1 bytes (            dest + 2),           29,          158,          212
     2 bytes (            dest + 2),           47,          193,          234
     4 bytes (            dest + 2),           42,          299,          289
     8 bytes (            dest + 2),           79,          431,          380
    16 bytes (            dest + 2),          234,          540,          543
    32 bytes (            dest + 2),          292,          635,          846
    64 bytes (            dest + 2),          726,          935,         1426
   128 bytes (            dest + 2),         1469,         1533,         2585
   256 bytes (            dest + 2),         2845,         2728,         5075
   512 bytes (            dest + 2),         5669,         5158,         9671
  1024 bytes (            dest + 2),        11415,        10094,        18913
  2048 bytes (            dest + 2),        22725,        19666,        37637
  4096 bytes (            dest + 2),        45590,        38953,        75089
  8192 bytes (            dest + 2),        93694,        79646,       152015
 16384 bytes (            dest + 2),       347007,       317867,       458607
 32768 bytes (            dest + 2),       647709,       634797,       914786
 65536 bytes (            dest + 2),      1294658,      1268432,      1826607
     1 bytes (            dest + 3),           29,          162,          216
     2 bytes (            dest + 3),           48,          198,          281
     4 bytes (            dest + 3),           43,          303,          370
     8 bytes (            dest + 3),           79,          435,          533
    16 bytes (            dest + 3),          147,          544,          841
    32 bytes (            dest + 3),          292,          639,         1519
    64 bytes (            dest + 3),          727,          938,         2593
   128 bytes (            dest + 3),         1433,         1679,         4910
   256 bytes (            dest + 3),         2846,         2732,         9712
   512 bytes (            dest + 3),         5669,         5307,        18909
  1024 bytes (            dest + 3),        11415,        10089,        37560
  2048 bytes (            dest + 3),        22806,        19668,        74831
  4096 bytes (            dest + 3),        45597,        38966,       149732
  8192 bytes (            dest + 3),        93851,        79650,       301744
 16384 bytes (            dest + 3),       346999,       317993,       757269
 32768 bytes (            dest + 3),       647716,       634870,      1512931
 65536 bytes (            dest + 3),      1294466,      1268462,      3024698
     1 bytes (source + 1, dest + 3),           29,          162,          243
     2 bytes (source + 1, dest + 3),           47,          198,          339
     4 bytes (source + 1, dest + 3),           42,          303,          325
     8 bytes (source + 1, dest + 3),           79,          498,          469
    16 bytes (source + 1, dest + 3),          148,          607,          844
    32 bytes (source + 1, dest + 3),          292,          824,         1338
    64 bytes (source + 1, dest + 3),          727,         1131,         2497
   128 bytes (source + 1, dest + 3),         1434,         1728,         4979
   256 bytes (source + 1, dest + 3),         2846,         2924,         9617
   512 bytes (source + 1, dest + 3),         5709,         5316,        18942
  1024 bytes (source + 1, dest + 3),        11419,        10289,        37456
  2048 bytes (source + 1, dest + 3),        22764,        19848,        74742
  4096 bytes (source + 1, dest + 3),        45608,        39152,       149662
  8192 bytes (source + 1, dest + 3),        94150,        79911,       301699
 16384 bytes (source + 1, dest + 3),       348000,       318247,       758704
 32768 bytes (source + 1, dest + 3),       648196,       635062,      1516280
 65536 bytes (source + 1, dest + 3),      1295451,      1268595,      3031313
     1 bytes (source + 2, dest + 2),           30,          159,          212
     2 bytes (source + 2, dest + 2),           47,          194,          252
     4 bytes (source + 2, dest + 2),           43,          299,          271
     8 bytes (source + 2, dest + 2),           79,          448,          344
    16 bytes (source + 2, dest + 2),          146,          529,          489
    32 bytes (source + 2, dest + 2),          341,          693,          872
    64 bytes (source + 2, dest + 2),          726,          861,         1357
   128 bytes (source + 2, dest + 2),         1433,         1188,         2516
   256 bytes (source + 2, dest + 2),         2845,         1840,         4872
   512 bytes (source + 2, dest + 2),         5670,         3287,         9502
  1024 bytes (source + 2, dest + 2),        11417,         5795,        18876
  2048 bytes (source + 2, dest + 2),        22714,        11175,        37479
  4096 bytes (source + 2, dest + 2),        45525,        21650,        74965
  8192 bytes (source + 2, dest + 2),        94131,        44842,       151922
 16384 bytes (source + 2, dest + 2),       347943,       253951,       458896
 32768 bytes (source + 2, dest + 2),       648055,       506185,       915010
 65536 bytes (source + 2, dest + 2),      1295397,      1010665,      1826864
     1 bytes (source + 3, dest + 1),           30,          157,          257
     2 bytes (source + 3, dest + 1),           47,          194,          248
     4 bytes (source + 3, dest + 1),           43,          298,          320
     8 bytes (source + 3, dest + 1),           79,          502,          466
    16 bytes (source + 3, dest + 1),          147,          548,          754
    32 bytes (source + 3, dest + 1),          293,          766,         1335
    64 bytes (source + 3, dest + 1),          776,         1071,         2588
   128 bytes (source + 3, dest + 1),         1433,         1668,         4847
   256 bytes (source + 3, dest + 1),         2896,         2864,         9566
   512 bytes (source + 3, dest + 1),         5762,         5257,        18888
  1024 bytes (source + 3, dest + 1),        11318,        10078,        37579
  2048 bytes (source + 3, dest + 1),        22716,        19901,        74738
  4096 bytes (source + 3, dest + 1),        45595,        39092,       149664
  8192 bytes (source + 3, dest + 1),        93982,        79739,       301811
 16384 bytes (source + 3, dest + 1),       348194,       317505,       790584
 32768 bytes (source + 3, dest + 1),       648093,       633424,      1580938
 65536 bytes (source + 3, dest + 1),      1295211,      1265323,      3161605

Noware
« Last Edit: September 02, 2008, 10:08:30 AM by Noware » Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Flatmush
Has a normal user title
Administrator
Hero Member
*

Karma: +84/-26
Offline Offline

Posts: 1046
12906.27 points

View Inventory
Send Money to Flatmush

The Omniscient One


View Profile WWW
« Reply #11 on: September 02, 2008, 10:02:43 AM »

Yeah thats as expected, I don't really use memcopy for small amount of memory or really care about its performance in those cases, also I make a point of always using aligned memory.

There's still a few possible improvements on it but as raphael said in the thread memcpy isn't often called at points where performance is required.

Still, this is tempting me to have another attempt at a fast memcpy, I shall possibly post back with a better version, if I get around to it.

When unaligned my memcopy currently just gives up and does a bytewise copy the same as the original newlib implementation did so it was better in all cases,
Logged

Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c

I am nerdier than 66% of all people. Are you nerdier? Click here to find out!I am 62% loser. What about you? Click here to find out!NerdTests.com User Test: The Can I Run A Business Test.

Hehe I'm a "Hero Member" because I bought posts back when they were in the shop.

Creator of FlatEditPSP, funcLib and flAstro
Noware
C/C++ Developer
C/C++ Developer
Hero Member
*

Karma: +41/-2
Offline Offline

Posts: 685
37495.68 points

View Inventory
Send Money to Noware

Avatar by: Jason Hise


View Profile
« Reply #12 on: September 02, 2008, 10:30:54 AM »

Hi Flatmush,

All true, thats why we at my work have all kinds of memcpy for special cases, also you can remove the checks in your memcpy, and making it faster again.

Noware
Logged

Reporter - What do you think of western civilization?
Gandhi - I think it would be a good idea!
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #13 on: September 02, 2008, 08:24:54 PM »

I took the opportunity and changed your bench code a little bit and tested it against daniels, flatmushs and my implementation of a memcpy.
Here's the results:
Code:
                                        libc      daniel   flatmush    raphael
     1 bytes (     4 bytes aligned),    100.0%,     91.0%,     69.8%,    101.5%
     2 bytes (     4 bytes aligned),    100.0%,     92.7%,     77.0%,    101.2%
     4 bytes (     4 bytes aligned),    100.0%,    135.9%,    111.8%,    100.4%
     8 bytes (     4 bytes aligned),    100.0%,    125.5%,    170.2%,    150.6%
    16 bytes (     4 bytes aligned),    100.0%,     63.6%,    102.2%,     84.9%
    32 bytes (     4 bytes aligned),    100.0%,     87.9%,    116.7%,     89.4%
    64 bytes (     4 bytes aligned),    100.0%,     96.3%,    116.5%,    188.3%
   128 bytes (     4 bytes aligned),    100.0%,    104.1%,    116.6%,    243.1%
   256 bytes (     4 bytes aligned),    100.0%,    112.5%,    119.7%,    312.1%
   512 bytes (     4 bytes aligned),    100.0%,    111.3%,    116.6%,    350.7%
  1024 bytes (     4 bytes aligned),    100.0%,    112.3%,    117.3%,    381.5%
  2048 bytes (     4 bytes aligned),    100.0%,    116.2%,    118.5%,    390.1%
  4096 bytes (     4 bytes aligned),    100.0%,    116.1%,    116.1%,    412.4%
  8192 bytes (     4 bytes aligned),    100.0%,    116.6%,    117.3%,    416.0%
 16384 bytes (     4 bytes aligned),    100.0%,    103.9%,    104.0%,   1103.8%
 32768 bytes (     4 bytes aligned),    100.0%,    103.8%,    104.0%,    182.6%
 65536 bytes (     4 bytes aligned),    100.0%,    103.8%,    104.0%,    182.6%
     1 bytes (          source + 1),    100.0%,     92.4%,     62.4%,    100.8%
     2 bytes (          source + 1),    100.0%,     93.2%,     72.9%,    101.2%
     4 bytes (          source + 1),    100.0%,     83.8%,     75.9%,    100.4%
     8 bytes (          source + 1),    100.0%,     85.7%,     79.7%,     84.9%
    16 bytes (          source + 1),    100.0%,    118.0%,     79.1%,    106.4%
    32 bytes (          source + 1),    100.0%,    144.2%,     85.6%,    124.1%
    64 bytes (          source + 1),    100.0%,    200.1%,     86.5%,    138.8%
   128 bytes (          source + 1),    100.0%,    268.4%,     91.3%,    242.1%
   256 bytes (          source + 1),    100.0%,    283.3%,     87.1%,    348.2%
   512 bytes (          source + 1),    100.0%,    309.5%,     87.5%,    470.0%
  1024 bytes (          source + 1),    100.0%,    324.5%,     87.8%,    559.7%
  2048 bytes (          source + 1),    100.0%,    333.0%,     87.6%,    616.9%
  4096 bytes (          source + 1),    100.0%,    335.0%,     87.5%,    664.2%
  8192 bytes (          source + 1),    100.0%,    332.3%,     87.6%,    686.5%
 16384 bytes (          source + 1),    100.0%,    227.6%,     90.5%,    939.7%
 32768 bytes (          source + 1),    100.0%,    227.8%,     90.4%,    448.6%
 65536 bytes (          source + 1),    100.0%,    228.0%,     90.4%,    450.2%
     1 bytes (          source + 2),    100.0%,     91.7%,     69.5%,    100.8%
     2 bytes (          source + 2),    100.0%,     93.2%,     69.8%,    101.2%
     4 bytes (          source + 2),    100.0%,     83.5%,     91.2%,    100.4%
     8 bytes (          source + 2),    100.0%,     89.6%,    110.6%,     89.2%
    16 bytes (          source + 2),    100.0%,    121.7%,    131.9%,    109.6%
    32 bytes (          source + 2),    100.0%,    155.3%,    148.5%,    126.5%
    64 bytes (          source + 2),    100.0%,    211.2%,    159.8%,    137.3%
   128 bytes (          source + 2),    100.0%,    238.5%,    164.7%,    233.8%
   256 bytes (          source + 2),    100.0%,    300.1%,    175.0%,    357.4%
   512 bytes (          source + 2),    100.0%,    310.4%,    174.7%,    464.9%
  1024 bytes (          source + 2),    100.0%,    325.5%,    174.6%,    562.4%
  2048 bytes (          source + 2),    100.0%,    333.1%,    174.9%,    616.7%
  4096 bytes (          source + 2),    100.0%,    334.9%,    174.9%,    665.2%
  8192 bytes (          source + 2),    100.0%,    332.5%,    174.1%,    686.0%
 16384 bytes (          source + 2),    100.0%,    227.1%,    143.8%,    935.4%
 32768 bytes (          source + 2),    100.0%,    227.6%,    143.9%,    448.1%
 65536 bytes (          source + 2),    100.0%,    227.8%,    144.0%,    449.7%
     1 bytes (          source + 3),    100.0%,     91.0%,     55.2%,    101.5%
     2 bytes (          source + 3),    100.0%,     92.6%,     72.1%,    100.6%
     4 bytes (          source + 3),    100.0%,     83.4%,     75.8%,    100.4%
     8 bytes (          source + 3),    100.0%,     95.7%,     79.7%,     93.2%
    16 bytes (          source + 3),    100.0%,    129.1%,     84.0%,    113.4%
    32 bytes (          source + 3),    100.0%,    161.8%,     85.5%,    129.4%
    64 bytes (          source + 3),    100.0%,    217.2%,     86.5%,    142.9%
   128 bytes (          source + 3),    100.0%,    263.1%,     84.8%,    229.4%
   256 bytes (          source + 3),    100.0%,    297.3%,     86.4%,    353.0%
   512 bytes (          source + 3),    100.0%,    319.2%,     87.8%,    470.7%
  1024 bytes (          source + 3),    100.0%,    324.6%,     87.7%,    550.9%
  2048 bytes (          source + 3),    100.0%,    333.5%,     87.7%,    616.9%
  4096 bytes (          source + 3),    100.0%,    336.6%,     87.6%,    666.5%
  8192 bytes (          source + 3),    100.0%,    332.7%,     87.6%,    688.3%
 16384 bytes (          source + 3),    100.0%,    223.9%,     90.9%,    924.1%
 32768 bytes (          source + 3),    100.0%,    224.2%,     90.9%,    441.3%
 65536 bytes (          source + 3),    100.0%,    224.4%,     90.9%,    443.0%
     1 bytes (            dest + 1),    100.0%,     90.4%,     69.5%,    101.5%
     2 bytes (            dest + 1),    100.0%,     92.7%,     62.4%,    101.2%
     4 bytes (            dest + 1),    100.0%,     83.2%,     61.2%,    100.9%
     8 bytes (            dest + 1),    100.0%,     98.3%,     63.8%,    113.1%
    16 bytes (            dest + 1),    100.0%,    131.6%,     70.8%,    129.7%
    32 bytes (            dest + 1),    100.0%,    204.6%,     76.1%,    140.0%
    64 bytes (            dest + 1),    100.0%,    252.0%,     81.1%,    460.6%
   128 bytes (            dest + 1),    100.0%,    263.7%,     83.5%,    555.5%
   256 bytes (            dest + 1),    100.0%,    297.3%,     85.0%,    622.5%
   512 bytes (            dest + 1),    100.0%,    328.0%,     86.4%,    670.4%
  1024 bytes (            dest + 1),    100.0%,    334.7%,     87.4%,    675.0%
  2048 bytes (            dest + 1),    100.0%,    335.9%,     87.5%,    694.8%
  4096 bytes (            dest + 1),    100.0%,    337.9%,     87.5%,    706.0%
  8192 bytes (            dest + 1),    100.0%,    334.3%,     87.6%,    713.5%
 16384 bytes (            dest + 1),    100.0%,    215.1%,     90.0%,    901.0%
 32768 bytes (            dest + 1),    100.0%,    215.2%,     90.0%,    439.6%
 65536 bytes (            dest + 1),    100.0%,    215.3%,     90.0%,    440.1%
     1 bytes (            dest + 2),    100.0%,     94.5%,     72.5%,    105.4%
     2 bytes (            dest + 2),    100.0%,     94.9%,     78.9%,    103.7%
     4 bytes (            dest + 2),    100.0%,     85.2%,     81.1%,     87.2%
     8 bytes (            dest + 2),    100.0%,    105.0%,     91.1%,    114.7%
    16 bytes (            dest + 2),    100.0%,    137.8%,    102.6%,    130.3%
    32 bytes (            dest + 2),    100.0%,    211.8%,    127.5%,    140.7%
    64 bytes (            dest + 2),    100.0%,    246.2%,    146.3%,    460.6%
   128 bytes (            dest + 2),    100.0%,    300.8%,    161.9%,    573.9%
   256 bytes (            dest + 2),    100.0%,    320.1%,    170.1%,    597.2%
   512 bytes (            dest + 2),    100.0%,    329.0%,    169.5%,    670.3%
  1024 bytes (            dest + 2),    100.0%,    335.3%,    173.2%,    676.1%
  2048 bytes (            dest + 2),    100.0%,    336.2%,    174.2%,    695.0%
  4096 bytes (            dest + 2),    100.0%,    338.1%,    174.4%,    705.7%
  8192 bytes (            dest + 2),    100.0%,    335.8%,    174.4%,    713.2%
 16384 bytes (            dest + 2),    100.0%,    215.1%,    148.7%,    910.1%
 32768 bytes (            dest + 2),    100.0%,    215.2%,    148.8%,    439.7%
 65536 bytes (            dest + 2),    100.0%,    215.3%,    148.8%,    440.1%
     1 bytes (            dest + 3),    100.0%,    106.9%,     82.0%,    117.4%
     2 bytes (            dest + 3),    100.0%,    105.7%,     70.7%,    114.1%
     4 bytes (            dest + 3),    100.0%,     92.3%,     60.0%,    111.1%
     8 bytes (            dest + 3),    100.0%,    107.7%,     68.1%,    120.1%
    16 bytes (            dest + 3),    100.0%,    139.3%,     70.2%,    134.0%
    32 bytes (            dest + 3),    100.0%,    222.9%,     81.7%,    150.6%
    64 bytes (            dest + 3),    100.0%,    236.8%,     81.2%,    474.2%
   128 bytes (            dest + 3),    100.0%,    293.8%,     83.0%,    564.1%
   256 bytes (            dest + 3),    100.0%,    298.2%,     86.4%,    626.6%
   512 bytes (            dest + 3),    100.0%,    328.9%,     86.5%,    671.1%
  1024 bytes (            dest + 3),    100.0%,    333.2%,     87.3%,    675.5%
  2048 bytes (            dest + 3),    100.0%,    336.3%,     87.4%,    703.8%
  4096 bytes (            dest + 3),    100.0%,    336.9%,     87.5%,    705.9%
  8192 bytes (            dest + 3),    100.0%,    336.8%,     87.8%,    715.4%
 16384 bytes (            dest + 3),    100.0%,    215.2%,     90.2%,    908.4%
 32768 bytes (            dest + 3),    100.0%,    215.2%,     90.2%,    439.6%
 65536 bytes (            dest + 3),    100.0%,    215.3%,     90.2%,    440.0%
     1 bytes (source + 1, dest + 3),    100.0%,    105.4%,     72.8%,    119.2%
     2 bytes (source + 1, dest + 3),    100.0%,    105.1%,     82.3%,    114.8%
     4 bytes (source + 1, dest + 3),    100.0%,     91.9%,     83.6%,    110.6%
     8 bytes (source + 1, dest + 3),    100.0%,     93.3%,     84.9%,     90.6%
    16 bytes (source + 1, dest + 3),    100.0%,    124.3%,     87.2%,    110.2%
    32 bytes (source + 1, dest + 3),    100.0%,    156.6%,     87.3%,    126.5%
    64 bytes (source + 1, dest + 3),    100.0%,    211.4%,     86.1%,    140.3%
   128 bytes (source + 1, dest + 3),    100.0%,    264.4%,     89.4%,    237.3%
   256 bytes (source + 1, dest + 3),    100.0%,    298.7%,     88.8%,    336.4%
   512 bytes (source + 1, dest + 3),    100.0%,    317.3%,     87.5%,    465.3%
  1024 bytes (source + 1, dest + 3),    100.0%,    325.2%,     87.9%,    552.9%
  2048 bytes (source + 1, dest + 3),    100.0%,    333.2%,     87.6%,    623.5%
  4096 bytes (source + 1, dest + 3),    100.0%,    336.4%,     87.6%,    665.3%
  8192 bytes (source + 1, dest + 3),    100.0%,    329.7%,     87.7%,    691.2%
 16384 bytes (source + 1, dest + 3),    100.0%,    214.6%,     90.1%,    875.6%
 32768 bytes (source + 1, dest + 3),    100.0%,    214.9%,     90.0%,    430.0%
 65536 bytes (source + 1, dest + 3),    100.0%,    215.2%,     90.0%,    431.5%
     1 bytes (source + 2, dest + 2),    100.0%,     93.8%,     71.6%,    104.6%
     2 bytes (source + 2, dest + 2),    100.0%,     94.9%,     71.2%,    103.1%
     4 bytes (source + 2, dest + 2),    100.0%,     85.2%,     92.8%,    102.2%
     8 bytes (source + 2, dest + 2),    100.0%,    103.5%,    111.2%,     93.0%
    16 bytes (source + 2, dest + 2),    100.0%,    145.3%,    132.9%,    133.2%
    32 bytes (source + 2, dest + 2),    100.0%,    191.2%,    142.2%,    204.3%
    64 bytes (source + 2, dest + 2),    100.0%,    287.5%,    160.3%,    140.1%
   128 bytes (source + 2, dest + 2),    100.0%,    389.1%,    158.7%,    278.3%
   256 bytes (source + 2, dest + 2),    100.0%,    482.3%,    172.8%,    472.9%
   512 bytes (source + 2, dest + 2),    100.0%,    525.3%,    175.0%,    810.9%
  1024 bytes (source + 2, dest + 2),    100.0%,    581.8%,    173.8%,   1153.0%
  2048 bytes (source + 2, dest + 2),    100.0%,    596.6%,    174.9%,   1522.6%
  4096 bytes (source + 2, dest + 2),    100.0%,    610.8%,    174.8%,   1813.8%
  8192 bytes (source + 2, dest + 2),    100.0%,    583.6%,    173.6%,   1964.9%
 16384 bytes (source + 2, dest + 2),    100.0%,    268.8%,    148.6%,   2414.6%
 32768 bytes (source + 2, dest + 2),    100.0%,    269.3%,    148.8%,    473.2%
 65536 bytes (source + 2, dest + 2),    100.0%,    269.6%,    148.8%,    474.7%
     1 bytes (source + 3, dest + 1),    100.0%,     92.4%,     56.6%,    102.3%
     2 bytes (source + 3, dest + 1),    100.0%,     92.7%,     72.6%,    101.9%
     4 bytes (source + 3, dest + 1),    100.0%,     84.1%,     58.3%,    101.3%
     8 bytes (source + 3, dest + 1),    100.0%,    101.1%,     66.2%,     93.2%
    16 bytes (source + 3, dest + 1),    100.0%,    134.2%,     84.0%,    113.2%
    32 bytes (source + 3, dest + 1),    100.0%,    187.6%,     96.6%,    145.7%
    64 bytes (source + 3, dest + 1),    100.0%,    238.8%,     93.4%,    154.3%
   128 bytes (source + 3, dest + 1),    100.0%,    276.8%,     90.5%,    246.0%
   256 bytes (source + 3, dest + 1),    100.0%,    297.7%,     88.9%,    358.2%
   512 bytes (source + 3, dest + 1),    100.0%,    312.9%,     87.6%,    470.7%
  1024 bytes (source + 3, dest + 1),    100.0%,    328.8%,     87.4%,    561.5%
  2048 bytes (source + 3, dest + 1),    100.0%,    333.8%,     87.6%,    623.5%
  4096 bytes (source + 3, dest + 1),    100.0%,    336.8%,     87.6%,    664.9%
  8192 bytes (source + 3, dest + 1),    100.0%,    329.6%,     87.7%,    689.7%
 16384 bytes (source + 3, dest + 1),    100.0%,    227.0%,     91.3%,    924.8%
 32768 bytes (source + 3, dest + 1),    100.0%,    227.5%,     91.3%,    448.0%
 65536 bytes (source + 3, dest + 1),    100.0%,    227.7%,     91.2%,    449.4%
The percentage is how fast it is (compared to libc, as indicated by that always having 100%). The more the faster.
I uses memalign(64) to allocate the test buffers to go in favor of flatmushs implementation. My own implementation works same speed with raw mallocs (aligned to 16 bytes).
What I noticed:
-Flatmushs implementation's weekness, as already stated is small copies and especially unaligned as he didn't really care. It does allaround better than libc though
-Daniel's does especially well with unaligned copies. It also has a special case for 4bytes aligned copy which makes it stand out in that single case.
-Both (same as libc) behave bad when they start to reach the dcache size limit (16kb), where my implementation still gains full boost of cache (hence there's an incredible peek of ~1000+% at 16kb copies). This is because the others share the dcache for read and write.
-My implementation's week point is in source unaligned copies in range 8-128 bytes, as I just didn't build in enough special cases for that. It's still better than libc and flatmushs there though.
-Measures are uncorrect to ~1% and are sometimes even quite a bit more off (I had one run where it would measure 150%+ for daniels and my implementation for <=4 byte aligned copies), which I can't exactly explain yet.
-I also can't fully explain the drop in performance for 16 and 32 bytes aligned copies, as the same code as for 8 byte kicks in... On libc's side at 16bytes it's special code kicks in, so that at least makes up for *a* drop, but still I don't quite see why I'm getting only ~80% where it should at least be close to 100% or in the mid 90s.

Explanation onto my implementation:
I use vfpu copies for anything >= 64 bytes, bypassing the cpu write cache and instead making use of vfpu write cache. This allows for the whole 16kb dcache to be used for src and hence explains the speed up to this size. For smaller sizes, i either fall back to raw byte per byte copy (< 8 bytes) or do a dst align if needed and do 32bit writes as far as possible. The biggest performance for 2/2 unaligned 16kb copies comes from the fact that libc doesn't handle that case to just copy 2 bytes and then do aligned copies, but fall back to raw byte copies, while my implementation does the alignment correction. Hence, the 2/2 unaligned copies are actually as fast as the aligned copies, just libc behaves worse so I (and daniel) get much better performance values there.
« Last Edit: September 02, 2008, 08:41:23 PM by Raphael » Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Raphael
Global Moderator
Hero Member
*

Karma: +230/-10
Offline Offline

Posts: 1431
193700.11 points

View Inventory
Send Money to Raphael


View Profile WWW
« Reply #14 on: September 02, 2008, 08:26:19 PM »

And here's the full code of the test application I ran:

I opted to copy all memcpy functions in the code and make them not inlined, as else the tests would have gotten unrealistic results (I tried calling memcpy from my implementation for cases < 16 byte and then got better results than memcpy itself).
I also cleared the D- and I-Cache before each implementations run, to avoid the (small) advantage of the cache kicking in in the first run where it wouldn't for libc.
I also chose to output the information in percentage to make the differences clearer. Those time span numbers are just too hard to compare.

PS: It was compiled on GCC 4.1.0 using -O3
« Last Edit: September 03, 2008, 07:12:54 AM by Raphael » Logged

Don't push the river, it flows.
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
http://www.homebrew-illuminati.co.uk - serious homebrew development for all platforms
Alexander Berl
"A good mod is a combination playground monitor, priest, big brother/sister, psychiatrist, professor and more."
Pages: [1] 2
Print
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC Valid XHTML 1.0! Valid CSS!
Page created in 0.443 seconds with 38 queries.
Sister Sites: Guitar Hero 4   BrokeniTouch.com