Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« on: April 12, 2008, 03:35:59 PM » |
|
Here is the source code of a free and fast memcpy (I forgot the stats, but on a PSP and if you setup is right it's much faster then the default memcpy, that just copy 'N' bytes) http://www.vik.cc/daniel/portfolio/to override the default memcpy you have to add --allow-multiple-definition to your linker flags and setup the linker order so that the custom memcpy is linked before the default memcpy LDFLAGS += -Wl,--allow-multiple-definition Cash
|
|
|
|
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
|
|
Flatmush
Has a normal user title
Administrator
Hero Member
Karma: +84/-26
Offline
Posts: 1046
12906.27 points View InventorySend Money to Flatmush
The Omniscient One
|
 |
« Reply #1 on: April 12, 2008, 04:55:58 PM » |
|
I made a psp specific version of this long ago, look here http://www.psp-programming.com/forums/index.php?topic=2731.0Also the psp bus width is 256, so the fastest method is to copy memory 256 bits at a time providing it is correctly aligned (just expand the memcpy function there, not that hard).
|
|
|
|
|
Logged
|
Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c    Hehe I'm a "Hero Member" because I bought posts back when they were in the shop. Creator of FlatEditPSP, funcLib and flAstro
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #2 on: April 13, 2008, 02:34:13 AM » |
|
Hi Flatmush, I don't say this is the fastest memcpy, especially when you know your data is aligned, etc, but for all other cases (like operator=, etc) it's a good replacement for the default memcpy Also I like to mention that copies of less then 8 bytes get more expensive by all the if checks, so I advice people to first check count/size. A change that I made in Daniel Vik code is: if (count < 8) { if (count >= 4 && ((((u32)src8 | (u32)dst8)) & 3) == 0) { *((u32 *)dst8) = *((u32 *)src8); dst8 += 4; src8 += 4; count -= 4; }
START_VAL(dst8); START_VAL(src8);
while (count--) { INC_VAL(dst8) = INC_VAL(src8); }
return dest; }
to if (count < 8) { START_VAL(dst8); START_VAL(src8);
while (count--) { INC_VAL(dst8) = INC_VAL(src8); }
return dest; }
I think the same is true for Flatmush memCopy Cash
|
|
|
|
« Last Edit: April 13, 2008, 03:26:09 AM by Cash »
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Flatmush
Has a normal user title
Administrator
Hero Member
Karma: +84/-26
Offline
Posts: 1046
12906.27 points View InventorySend Money to Flatmush
The Omniscient One
|
 |
« Reply #3 on: April 13, 2008, 04:16:32 AM » |
|
Yeah but that adds yet more checks for longer copies and memcpy should only be used for relatively large copies. If you're using memcpy for less than 8 bytes you probably shouldn't be. Also remember that reading memory is much slower than any other operation so in comparison the checks are a very small overhead.
Oh and for the free memory function. Apparently the meminfo way isn't 100% accurate, I forget the reason but if you check ps2dev then there should be a topic about it there. That free memory function I pasted is the fastest method for getting free memory that you can get by only using mallocs and frees. Besides knowing free memory is usually more of a debug feature, I've never known a reason to need a high performance free memory. The best way if you want it to be fast is to keep track of mallocs and frees which is easy enough.
|
|
|
|
« Last Edit: April 13, 2008, 04:29:46 AM by Flatmush »
|
Logged
|
Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c    Hehe I'm a "Hero Member" because I bought posts back when they were in the shop. Creator of FlatEditPSP, funcLib and flAstro
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #4 on: April 13, 2008, 08:34:58 AM » |
|
Hi Flatmush, Yeah but that adds yet more checks for longer copies and memcpy should only be used for relatively large copies. If you're using memcpy for less than 8 bytes you probably shouldn't be
but the default operator= will also use memcpy for small structures / classes Oh and for the free memory function. Apparently the meminfo way isn't 100% accurate, I forget the reason but if you check ps2dev then there should be a topic about it there. That free memory function I pasted is the fastest method for getting free memory that you can get by only using mallocs and frees. Besides knowing free memory is usually more of a debug feature, I've never known a reason to need a high performance free memory. The best way if you want it to be fast is to keep track of mallocs and frees which is easy enough.
Ok, you are right, lucky I use my own memory manager (and malloc, free, new, delete, etc also use my memory manager) and getting the exact amount of free memory is in my case totally free  Cash
|
|
|
|
« Last Edit: April 13, 2008, 08:41:07 AM by Cash »
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Flatmush
Has a normal user title
Administrator
Hero Member
Karma: +84/-26
Offline
Posts: 1046
12906.27 points View InventorySend Money to Flatmush
The Omniscient One
|
 |
« Reply #5 on: April 13, 2008, 08:43:58 AM » |
|
but the default operator= will use memcpy also for small structures / classes. Dunno where you heard that but it's totally untrue. I did some benchmarks using the equals operator on a 256-bit (32 byte) structure in my memCopy and it was much faster than memcpy. If memcpy was used then my function would have been much slower with performance through the floor, but it wasn't. If you download funclib and look at some of the memory alignment and memory speed benchmarking programs included you can see that. Actually it may be used in C++ for classes (not structures), but thats one of the many reasons that I don't use C++. Still with optimizations enabled, I can't see why a good compiler would use memcpy rather than pure assembly.
|
|
|
|
|
Logged
|
Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c    Hehe I'm a "Hero Member" because I bought posts back when they were in the shop. Creator of FlatEditPSP, funcLib and flAstro
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #6 on: April 13, 2008, 09:32:18 AM » |
|
Sorry I was taking about sizes of less then 8 bytes, I truly belief that your funtion is faster then the default memcpy, it looks faster anyway... Also memcpy is used by other libs, and sometimes they even don't know amount of bytes that they have to copy. Actually it may be used in C++ for classes (not structures)
Classes and structures are the same thing in C++, except that a class is protected and a structure public Cash
|
|
|
|
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #7 on: September 01, 2008, 02:13:10 PM » |
|
Hi, I know this is a old topic, but I (think) this is nice to know Finally I did what Flatmush asked my to do, profile different kind of memcpy's To my surprise things have changes a lot since 2 ago (note I never tested Flatmush memcpy before) when I tested this 2 years ago Daniel Vik's memcpy was the fastest overall even in small blockes of memory First the result: GCC has the fasted memcpy for small blocks of memory, Daniel Vik's (see http://www.vik.cc/daniel/portfolio) memcpy is the fastest memcpy for large aligned blocks of memory but the slowest for small blocks of memory, and finally (sorry) Flatmush (see http://www.psp-programming.com/forums/index.php/topic,2731.0.html) memcpy is overall the slowest memcpy My test is done GCC 4.3.1 with -O5 added to the compiler flags, and counts the number of ticks for coping 1000 times X amount of bytes. here is the profile code, so you can do the test your self (or find bugs in my code) i64 GetCurrentTick() { u64 tick; sceRtcGetCurrentTick(&tick); return (i64)tick; }
void ProfileMemCopy() { unsigned char* block1 = (unsigned char*)malloc(65536 + 32); unsigned char* block2 = (unsigned char*)malloc(65536 + 32);
#define COPY(d, s, n, a) { \ int gcc_elapsed = 0; \ { \ u64 time = GetCurrentTick(); \ for (int j=0; j<1000; ++j) \ memcpy(d, s, n); \ gcc_elapsed = (int)(GetCurrentTick()-time); \ } \ int danielvik_elapsed = 0; \ { \ u64 time = GetCurrentTick(); \ for (int j=0; j<1000; ++j) \ MemCopyDanielVik(d, s, n); \ danielvik_elapsed = (int)(GetCurrentTick()-time); \ } \ int flatmush_elapsed = 0; \ { \ u64 time = GetCurrentTick(); \ for (int j=0; j<1000; ++j) \ MemCopyFlatmush(d, s, n); \ flatmush_elapsed = (int)(GetCurrentTick()-time); \ } \ scePowerTick(0); \ printf("%6d bytes (%20s), %12d, %12d, %12d\n", (int)n, a, gcc_elapsed, danielvik_elapsed, flatmush_elapsed); \ }
#define COPY_LOOP(d, s, a) \ COPY(d, s, 1, a) \ COPY(d, s, 2, a) \ COPY(d, s, 4, a) \ COPY(d, s, 8, a) \ COPY(d, s, 16, a) \ COPY(d, s, 32, a) \ COPY(d, s, 64, a) \ COPY(d, s, 128, a) \ COPY(d, s, 256, a) \ COPY(d, s, 512, a) \ COPY(d, s, 1024, a) \ COPY(d, s, 2048, a) \ COPY(d, s, 4096, a) \ COPY(d, s, 8192, a) \ COPY(d, s, 16384, a) \ COPY(d, s, 32768, a) \ COPY(d, s, 65536, a)
COPY_LOOP(block1, block2, "4 bytes aligned"); COPY_LOOP(&block1[1], block2, "source + 1"); COPY_LOOP(&block1[2], block2, "source + 2"); COPY_LOOP(&block1[3], block2, "source + 3"); COPY_LOOP(block1, &block2[1], "dest + 1"); COPY_LOOP(block1, &block2[2], "dest + 2"); COPY_LOOP(block1, &block2[3], "dest + 3"); COPY_LOOP(&block1[1], &block2[3], "source + 1, dest + 3"); COPY_LOOP(&block1[2], &block2[2], "source + 2, dest + 2"); COPY_LOOP(&block1[3], &block2[1], "source + 3, dest + 1");
// TODO Add unaligned copy tests
free(block1); free(block2);
#undef COPY #undef COPY_LOOP }
here are the results: bytes src / dst alignment GCC Daniel Vik Flatmush
1 bytes ( 4 bytes aligned), 30, 157, 90 2 bytes ( 4 bytes aligned), 47, 193, 148 4 bytes ( 4 bytes aligned), 44, 266, 102 8 bytes ( 4 bytes aligned), 80, 388, 174 16 bytes ( 4 bytes aligned), 148, 470, 262 32 bytes ( 4 bytes aligned), 293, 493, 424 64 bytes ( 4 bytes aligned), 727, 657, 750 128 bytes ( 4 bytes aligned), 1434, 981, 1401 256 bytes ( 4 bytes aligned), 2847, 1768, 2742 512 bytes ( 4 bytes aligned), 5716, 2936, 5440 1024 bytes ( 4 bytes aligned), 11460, 5594, 10528 2048 bytes ( 4 bytes aligned), 22820, 10861, 21069 4096 bytes ( 4 bytes aligned), 45630, 21384, 42103 8192 bytes ( 4 bytes aligned), 93212, 43205, 90628 16384 bytes ( 4 bytes aligned), 364419, 274365, 346937 32768 bytes ( 4 bytes aligned), 679297, 548399, 692750 65536 bytes ( 4 bytes aligned), 1358228, 1096055, 1384869 1 bytes ( source + 1), 30, 157, 39 2 bytes ( source + 1), 48, 193, 184 4 bytes ( source + 1), 44, 297, 275 8 bytes ( source + 1), 79, 492, 429 16 bytes ( source + 1), 148, 601, 755 32 bytes ( source + 1), 293, 819, 1407 64 bytes ( source + 1), 728, 1219, 2757 128 bytes ( source + 1), 1434, 1720, 5355 256 bytes ( source + 1), 2887, 2916, 10663 512 bytes ( source + 1), 5675, 5346, 21167 1024 bytes ( source + 1), 11505, 10086, 42093 2048 bytes ( source + 1), 22819, 19855, 84032 4096 bytes ( source + 1), 45656, 39048, 168130 8192 bytes ( source + 1), 94151, 78420, 355967 16384 bytes ( source + 1), 353287, 328482, 861445 32768 bytes ( source + 1), 649488, 656018, 1721655 65536 bytes ( source + 1), 1298463, 1311346, 3442265 1 bytes ( source + 2), 30, 157, 61 2 bytes ( source + 2), 48, 193, 129 4 bytes ( source + 2), 44, 297, 202 8 bytes ( source + 2), 80, 484, 275 16 bytes ( source + 2), 148, 593, 437 32 bytes ( source + 2), 293, 810, 764 64 bytes ( source + 2), 728, 1113, 1415 128 bytes ( source + 2), 1435, 1711, 2756 256 bytes ( source + 2), 2849, 2945, 5326 512 bytes ( source + 2), 5718, 5296, 10666 1024 bytes ( source + 2), 11510, 10185, 21066 2048 bytes ( source + 2), 22926, 19739, 42020 4096 bytes ( source + 2), 45692, 39006, 84206 8192 bytes ( source + 2), 93976, 78564, 179612 16384 bytes ( source + 2), 353292, 328411, 512551 32768 bytes ( source + 2), 649345, 655968, 1023611 65536 bytes ( source + 2), 1298321, 1311197, 2046832 1 bytes ( source + 3), 30, 157, 39 2 bytes ( source + 3), 48, 193, 184 4 bytes ( source + 3), 44, 297, 276 8 bytes ( source + 3), 80, 434, 430 16 bytes ( source + 3), 148, 542, 756 32 bytes ( source + 3), 293, 760, 1407 64 bytes ( source + 3), 728, 1207, 2711 128 bytes ( source + 3), 1435, 1661, 5356 256 bytes ( source + 3), 2848, 2896, 10666 512 bytes ( source + 3), 5715, 5245, 21174 1024 bytes ( source + 3), 11459, 10076, 42101 2048 bytes ( source + 3), 22817, 19689, 84039 4096 bytes ( source + 3), 45610, 39066, 168157 8192 bytes ( source + 3), 94141, 78505, 357734 16384 bytes ( source + 3), 353236, 328402, 865486 32768 bytes ( source + 3), 649373, 656020, 1729987 65536 bytes ( source + 3), 1298300, 1311287, 3459011 1 bytes ( dest + 1), 30, 157, 89 2 bytes ( dest + 1), 48, 194, 235 4 bytes ( dest + 1), 44, 297, 316 8 bytes ( dest + 1), 80, 434, 473 16 bytes ( dest + 1), 148, 543, 796 32 bytes ( dest + 1), 293, 638, 1448 64 bytes ( dest + 1), 728, 938, 2751 128 bytes ( dest + 1), 1436, 1535, 5476 256 bytes ( dest + 1), 2848, 2767, 10697 512 bytes ( dest + 1), 5719, 5254, 21091 1024 bytes ( dest + 1), 11516, 9901, 42133 2048 bytes ( dest + 1), 22886, 19596, 84074 4096 bytes ( dest + 1), 45716, 38819, 168219 8192 bytes ( dest + 1), 94515, 79535, 354940 16384 bytes ( dest + 1), 351116, 340543, 858143 32768 bytes ( dest + 1), 647884, 679938, 1714881 65536 bytes ( dest + 1), 1294249, 1358843, 3427877 1 bytes ( dest + 2), 31, 157, 89 2 bytes ( dest + 2), 49, 198, 153 4 bytes ( dest + 2), 44, 302, 225 8 bytes ( dest + 2), 80, 435, 297 16 bytes ( dest + 2), 148, 543, 461 32 bytes ( dest + 2), 293, 675, 786 64 bytes ( dest + 2), 728, 938, 1439 128 bytes ( dest + 2), 1521, 1582, 2775 256 bytes ( dest + 2), 2848, 2730, 5521 512 bytes ( dest + 2), 5717, 5119, 10692 1024 bytes ( dest + 2), 11508, 9999, 21089 2048 bytes ( dest + 2), 22928, 19564, 42058 4096 bytes ( dest + 2), 45589, 38942, 84165 8192 bytes ( dest + 2), 94340, 79512, 178785 16384 bytes ( dest + 2), 351121, 340380, 509739 32768 bytes ( dest + 2), 647927, 679893, 1017893 65536 bytes ( dest + 2), 1294358, 1358738, 2034686 1 bytes ( dest + 3), 31, 161, 90 2 bytes ( dest + 3), 48, 198, 297 4 bytes ( dest + 3), 44, 302, 316 8 bytes ( dest + 3), 80, 434, 472 16 bytes ( dest + 3), 148, 543, 796 32 bytes ( dest + 3), 293, 638, 1448 64 bytes ( dest + 3), 728, 939, 2751 128 bytes ( dest + 3), 1435, 1670, 5394 256 bytes ( dest + 3), 2848, 2767, 10700 512 bytes ( dest + 3), 5846, 5120, 21109 1024 bytes ( dest + 3), 11511, 10003, 42057 2048 bytes ( dest + 3), 22940, 19565, 84059 4096 bytes ( dest + 3), 45711, 38831, 168201 8192 bytes ( dest + 3), 94529, 79514, 353219 16384 bytes ( dest + 3), 351147, 340521, 853679 32768 bytes ( dest + 3), 647869, 679923, 1705541 65536 bytes ( dest + 3), 1294483, 1358795, 3410402 1 bytes (source + 1, dest + 3), 31, 161, 39 2 bytes (source + 1, dest + 3), 49, 197, 194 4 bytes (source + 1, dest + 3), 44, 301, 275 8 bytes (source + 1, dest + 3), 80, 584, 430 16 bytes (source + 1, dest + 3), 149, 606, 756 32 bytes (source + 1, dest + 3), 293, 823, 1407 64 bytes (source + 1, dest + 3), 728, 1215, 2794 128 bytes (source + 1, dest + 3), 1435, 1725, 5355 256 bytes (source + 1, dest + 3), 2849, 2959, 10659 512 bytes (source + 1, dest + 3), 5719, 5310, 21179 1024 bytes (source + 1, dest + 3), 11377, 10226, 42070 2048 bytes (source + 1, dest + 3), 22830, 19751, 84022 4096 bytes (source + 1, dest + 3), 45703, 39015, 168137 8192 bytes (source + 1, dest + 3), 95091, 79727, 353672 16384 bytes (source + 1, dest + 3), 373440, 340565, 858436 32768 bytes (source + 1, dest + 3), 727850, 679973, 1714968 65536 bytes (source + 1, dest + 3), 1454780, 1358868, 3429155 1 bytes (source + 2, dest + 2), 32, 156, 62 2 bytes (source + 2, dest + 2), 48, 193, 131 4 bytes (source + 2, dest + 2), 45, 297, 202 8 bytes (source + 2, dest + 2), 80, 446, 276 16 bytes (source + 2, dest + 2), 148, 529, 438 32 bytes (source + 2, dest + 2), 329, 691, 764 64 bytes (source + 2, dest + 2), 728, 859, 1416 128 bytes (source + 2, dest + 2), 1435, 1185, 2720 256 bytes (source + 2, dest + 2), 2937, 1885, 5362 512 bytes (source + 2, dest + 2), 5676, 3307, 10542 1024 bytes (source + 2, dest + 2), 11505, 5749, 21169 2048 bytes (source + 2, dest + 2), 22823, 11076, 42027 4096 bytes (source + 2, dest + 2), 45695, 21529, 84138 8192 bytes (source + 2, dest + 2), 95245, 44542, 179694 16384 bytes (source + 2, dest + 2), 373649, 275084, 512546 32768 bytes (source + 2, dest + 2), 727856, 548990, 1023607 65536 bytes (source + 2, dest + 2), 1454913, 1096913, 2046149 1 bytes (source + 3, dest + 1), 30, 156, 39 2 bytes (source + 3, dest + 1), 47, 193, 195 4 bytes (source + 3, dest + 1), 43, 297, 276 8 bytes (source + 3, dest + 1), 79, 438, 429 16 bytes (source + 3, dest + 1), 148, 547, 755 32 bytes (source + 3, dest + 1), 293, 765, 1503 64 bytes (source + 3, dest + 1), 728, 1068, 2711 128 bytes (source + 3, dest + 1), 1435, 1666, 5434 256 bytes (source + 3, dest + 1), 2849, 2861, 10696 512 bytes (source + 3, dest + 1), 5803, 5300, 21062 1024 bytes (source + 3, dest + 1), 11510, 10135, 42012 2048 bytes (source + 3, dest + 1), 22932, 19695, 84023 4096 bytes (source + 3, dest + 1), 45700, 38957, 168135 8192 bytes (source + 3, dest + 1), 95248, 79564, 357012 16384 bytes (source + 3, dest + 1), 373549, 328704, 869061 32768 bytes (source + 3, dest + 1), 727810, 656416, 1737751 65536 bytes (source + 3, dest + 1), 1454995, 1311726, 3474254
Noware
|
|
|
|
« Last Edit: September 01, 2008, 02:32:01 PM by Noware »
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Flatmush
Has a normal user title
Administrator
Hero Member
Karma: +84/-26
Offline
Posts: 1046
12906.27 points View InventorySend Money to Flatmush
The Omniscient One
|
 |
« Reply #8 on: September 02, 2008, 02:02:25 AM » |
|
My memcpy is optimized for the psp and is 3 times faster than the newlib memcpy for the psp. GCC doesn't have a memcpy, thats part of newlib, also due to the fact that the psp uses a 256-bit wide bus, my memcpy works best for memory aligned to 256-bits which is 32-byte aligned, it works out faster if you make sure malloc aligns to 32-bits rather than the default 16. Edit: My memcopy from My Memory Findings isn't the complete thing, its just a sample. Checkout the memCopy in funcLib and you should find it a little more complex but a lot faster. There are still ways to speed it up though, first it needs doing in asm, second it needs to do unaligned copies by aligning and rotational shifting but I got bored and decided it wasn't worth the effort.
|
|
|
|
|
Logged
|
Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c    Hehe I'm a "Hero Member" because I bought posts back when they were in the shop. Creator of FlatEditPSP, funcLib and flAstro
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #9 on: September 02, 2008, 09:04:35 AM » |
|
Hi Flatmush, I'm testing normal memcpy's (as a replacement for the default memcpy from newlib) not special cases Edit: My memcopy from My Memory Findings isn't the complete thing, its just a sample. Checkout the memCopy in funcLib and you should find it a little more complex but a lot faster.
Ok, where can I find funcLib? [EDIT] found it, I will post the results later There are still ways to speed it up though, first it needs doing in asm, second it needs to do unaligned copies by aligning and rotational shifting but I got bored and decided it wasn't worth the effort.
Yep that is what Daniel Vik's memcpy is doing (without the asm part). Also I forgot to mention, this test is compiled in C++. Noware
|
|
|
|
« Last Edit: September 02, 2008, 09:14:34 AM by Noware »
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #10 on: September 02, 2008, 09:46:17 AM » |
|
Hi Flatmush, Here are my results with your memcpy from (funcLib_1.0.1), note I also changed the base memory alignment to 32 bytes. [EDIT] For making a fair test, I yust put your mempy into a C file and retested it but got the same results. As you can see your memcpy is the fastest when copying 64 to 16384 bytes (32 bytes aligned). bytes src / dst alignment newlib Daniel Vik Flatmush
1 bytes ( 4 bytes aligned), 30, 157, 212 2 bytes ( 4 bytes aligned), 47, 290, 234 4 bytes ( 4 bytes aligned), 43, 267, 212 8 bytes ( 4 bytes aligned), 79, 389, 211 16 bytes ( 4 bytes aligned), 147, 471, 236 32 bytes ( 4 bytes aligned), 292, 494, 293 64 bytes ( 4 bytes aligned), 636, 657, 456 128 bytes ( 4 bytes aligned), 1433, 984, 782 256 bytes ( 4 bytes aligned), 2877, 1694, 1433 512 bytes ( 4 bytes aligned), 5804, 2943, 2775 1024 bytes ( 4 bytes aligned), 11319, 5596, 5488 2048 bytes ( 4 bytes aligned), 19822, 10822, 10700 4096 bytes ( 4 bytes aligned), 45485, 21447, 21250 8192 bytes ( 4 bytes aligned), 93736, 44658, 44284 16384 bytes ( 4 bytes aligned), 349527, 253868, 253600 32768 bytes ( 4 bytes aligned), 648711, 506118, 506146 65536 bytes ( 4 bytes aligned), 1297196, 1010467, 1011059
But in all other cases it's the slowest, see below bytes src / dst alignment newlib Daniel Vik Flatmush
1 bytes ( source + 1), 29, 158, 239 2 bytes ( source + 1), 47, 228, 248 4 bytes ( source + 1), 43, 297, 320 8 bytes ( source + 1), 79, 493, 466 16 bytes ( source + 1), 146, 602, 755 32 bytes ( source + 1), 292, 820, 1335 64 bytes ( source + 1), 636, 1125, 2492 128 bytes ( source + 1), 1433, 1724, 4847 256 bytes ( source + 1), 2845, 2920, 9483 512 bytes ( source + 1), 5670, 5346, 18848 1024 bytes ( source + 1), 11417, 10135, 37489 2048 bytes ( source + 1), 19820, 19852, 74832 4096 bytes ( source + 1), 45468, 39291, 149658 8192 bytes ( source + 1), 94298, 79722, 301802 16384 bytes ( source + 1), 378065, 317656, 798925 32768 bytes ( source + 1), 729723, 633559, 1597221 65536 bytes ( source + 1), 1459384, 1265585, 3193823 1 bytes ( source + 2), 29, 158, 212 2 bytes ( source + 2), 47, 194, 253 4 bytes ( source + 2), 43, 297, 272 8 bytes ( source + 2), 79, 485, 429 16 bytes ( source + 2), 147, 595, 488 32 bytes ( source + 2), 293, 812, 778 64 bytes ( source + 2), 636, 1116, 1492 128 bytes ( source + 2), 1471, 1713, 2516 256 bytes ( source + 2), 2845, 2912, 5003 512 bytes ( source + 2), 5669, 5303, 9651 1024 bytes ( source + 2), 11431, 10126, 18877 2048 bytes ( source + 2), 19818, 19839, 37588 4096 bytes ( source + 2), 45608, 39147, 75010 8192 bytes ( source + 2), 94125, 79873, 152044 16384 bytes ( source + 2), 378044, 317580, 501413 32768 bytes ( source + 2), 729640, 633534, 1001932 65536 bytes ( source + 2), 1459355, 1265304, 2002851 1 bytes ( source + 3), 29, 158, 257 2 bytes ( source + 3), 48, 194, 248 4 bytes ( source + 3), 43, 298, 321 8 bytes ( source + 3), 79, 435, 465 16 bytes ( source + 3), 147, 543, 755 32 bytes ( source + 3), 292, 761, 1334 64 bytes ( source + 3), 636, 1065, 2493 128 bytes ( source + 3), 1467, 1665, 4945 256 bytes ( source + 3), 2845, 2860, 9612 512 bytes ( source + 3), 5759, 5252, 18890 1024 bytes ( source + 3), 11318, 10075, 37590 2048 bytes ( source + 3), 19820, 19785, 74736 4096 bytes ( source + 3), 45604, 39089, 149630 8192 bytes ( source + 3), 94301, 79818, 301671 16384 bytes ( source + 3), 377978, 317542, 781481 32768 bytes ( source + 3), 729747, 633426, 1561260 65536 bytes ( source + 3), 1459342, 1265321, 3120640 1 bytes ( dest + 1), 29, 157, 299 2 bytes ( dest + 1), 48, 193, 276 4 bytes ( dest + 1), 43, 332, 366 8 bytes ( dest + 1), 79, 435, 593 16 bytes ( dest + 1), 148, 544, 837 32 bytes ( dest + 1), 292, 640, 1430 64 bytes ( dest + 1), 726, 938, 2591 128 bytes ( dest + 1), 1432, 1596, 4939 256 bytes ( dest + 1), 2845, 2733, 9707 512 bytes ( dest + 1), 5669, 5126, 18979 1024 bytes ( dest + 1), 11419, 9949, 37593 2048 bytes ( dest + 1), 22716, 19672, 74942 4096 bytes ( dest + 1), 45598, 39019, 149760 8192 bytes ( dest + 1), 93764, 79481, 301910 16384 bytes ( dest + 1), 347007, 317950, 759472 32768 bytes ( dest + 1), 647689, 634842, 1517598 65536 bytes ( dest + 1), 1294498, 1268458, 3034157 1 bytes ( dest + 2), 29, 158, 212 2 bytes ( dest + 2), 47, 193, 234 4 bytes ( dest + 2), 42, 299, 289 8 bytes ( dest + 2), 79, 431, 380 16 bytes ( dest + 2), 234, 540, 543 32 bytes ( dest + 2), 292, 635, 846 64 bytes ( dest + 2), 726, 935, 1426 128 bytes ( dest + 2), 1469, 1533, 2585 256 bytes ( dest + 2), 2845, 2728, 5075 512 bytes ( dest + 2), 5669, 5158, 9671 1024 bytes ( dest + 2), 11415, 10094, 18913 2048 bytes ( dest + 2), 22725, 19666, 37637 4096 bytes ( dest + 2), 45590, 38953, 75089 8192 bytes ( dest + 2), 93694, 79646, 152015 16384 bytes ( dest + 2), 347007, 317867, 458607 32768 bytes ( dest + 2), 647709, 634797, 914786 65536 bytes ( dest + 2), 1294658, 1268432, 1826607 1 bytes ( dest + 3), 29, 162, 216 2 bytes ( dest + 3), 48, 198, 281 4 bytes ( dest + 3), 43, 303, 370 8 bytes ( dest + 3), 79, 435, 533 16 bytes ( dest + 3), 147, 544, 841 32 bytes ( dest + 3), 292, 639, 1519 64 bytes ( dest + 3), 727, 938, 2593 128 bytes ( dest + 3), 1433, 1679, 4910 256 bytes ( dest + 3), 2846, 2732, 9712 512 bytes ( dest + 3), 5669, 5307, 18909 1024 bytes ( dest + 3), 11415, 10089, 37560 2048 bytes ( dest + 3), 22806, 19668, 74831 4096 bytes ( dest + 3), 45597, 38966, 149732 8192 bytes ( dest + 3), 93851, 79650, 301744 16384 bytes ( dest + 3), 346999, 317993, 757269 32768 bytes ( dest + 3), 647716, 634870, 1512931 65536 bytes ( dest + 3), 1294466, 1268462, 3024698 1 bytes (source + 1, dest + 3), 29, 162, 243 2 bytes (source + 1, dest + 3), 47, 198, 339 4 bytes (source + 1, dest + 3), 42, 303, 325 8 bytes (source + 1, dest + 3), 79, 498, 469 16 bytes (source + 1, dest + 3), 148, 607, 844 32 bytes (source + 1, dest + 3), 292, 824, 1338 64 bytes (source + 1, dest + 3), 727, 1131, 2497 128 bytes (source + 1, dest + 3), 1434, 1728, 4979 256 bytes (source + 1, dest + 3), 2846, 2924, 9617 512 bytes (source + 1, dest + 3), 5709, 5316, 18942 1024 bytes (source + 1, dest + 3), 11419, 10289, 37456 2048 bytes (source + 1, dest + 3), 22764, 19848, 74742 4096 bytes (source + 1, dest + 3), 45608, 39152, 149662 8192 bytes (source + 1, dest + 3), 94150, 79911, 301699 16384 bytes (source + 1, dest + 3), 348000, 318247, 758704 32768 bytes (source + 1, dest + 3), 648196, 635062, 1516280 65536 bytes (source + 1, dest + 3), 1295451, 1268595, 3031313 1 bytes (source + 2, dest + 2), 30, 159, 212 2 bytes (source + 2, dest + 2), 47, 194, 252 4 bytes (source + 2, dest + 2), 43, 299, 271 8 bytes (source + 2, dest + 2), 79, 448, 344 16 bytes (source + 2, dest + 2), 146, 529, 489 32 bytes (source + 2, dest + 2), 341, 693, 872 64 bytes (source + 2, dest + 2), 726, 861, 1357 128 bytes (source + 2, dest + 2), 1433, 1188, 2516 256 bytes (source + 2, dest + 2), 2845, 1840, 4872 512 bytes (source + 2, dest + 2), 5670, 3287, 9502 1024 bytes (source + 2, dest + 2), 11417, 5795, 18876 2048 bytes (source + 2, dest + 2), 22714, 11175, 37479 4096 bytes (source + 2, dest + 2), 45525, 21650, 74965 8192 bytes (source + 2, dest + 2), 94131, 44842, 151922 16384 bytes (source + 2, dest + 2), 347943, 253951, 458896 32768 bytes (source + 2, dest + 2), 648055, 506185, 915010 65536 bytes (source + 2, dest + 2), 1295397, 1010665, 1826864 1 bytes (source + 3, dest + 1), 30, 157, 257 2 bytes (source + 3, dest + 1), 47, 194, 248 4 bytes (source + 3, dest + 1), 43, 298, 320 8 bytes (source + 3, dest + 1), 79, 502, 466 16 bytes (source + 3, dest + 1), 147, 548, 754 32 bytes (source + 3, dest + 1), 293, 766, 1335 64 bytes (source + 3, dest + 1), 776, 1071, 2588 128 bytes (source + 3, dest + 1), 1433, 1668, 4847 256 bytes (source + 3, dest + 1), 2896, 2864, 9566 512 bytes (source + 3, dest + 1), 5762, 5257, 18888 1024 bytes (source + 3, dest + 1), 11318, 10078, 37579 2048 bytes (source + 3, dest + 1), 22716, 19901, 74738 4096 bytes (source + 3, dest + 1), 45595, 39092, 149664 8192 bytes (source + 3, dest + 1), 93982, 79739, 301811 16384 bytes (source + 3, dest + 1), 348194, 317505, 790584 32768 bytes (source + 3, dest + 1), 648093, 633424, 1580938 65536 bytes (source + 3, dest + 1), 1295211, 1265323, 3161605
Noware
|
|
|
|
« Last Edit: September 02, 2008, 10:08:30 AM by Noware »
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Flatmush
Has a normal user title
Administrator
Hero Member
Karma: +84/-26
Offline
Posts: 1046
12906.27 points View InventorySend Money to Flatmush
The Omniscient One
|
 |
« Reply #11 on: September 02, 2008, 10:02:43 AM » |
|
Yeah thats as expected, I don't really use memcopy for small amount of memory or really care about its performance in those cases, also I make a point of always using aligned memory.
There's still a few possible improvements on it but as raphael said in the thread memcpy isn't often called at points where performance is required.
Still, this is tempting me to have another attempt at a fast memcpy, I shall possibly post back with a better version, if I get around to it.
When unaligned my memcopy currently just gives up and does a bytewise copy the same as the original newlib implementation did so it was better in all cases,
|
|
|
|
|
Logged
|
Firmware History: 2.60 -> 2.71 -> 1.50 -> 3.03oe-c    Hehe I'm a "Hero Member" because I bought posts back when they were in the shop. Creator of FlatEditPSP, funcLib and flAstro
|
|
|
Noware
C/C++ Developer
C/C++ Developer
Hero Member
Karma: +41/-2
Offline
Posts: 685
37495.68 points View InventorySend Money to Noware
Avatar by: Jason Hise
|
 |
« Reply #12 on: September 02, 2008, 10:30:54 AM » |
|
Hi Flatmush,
All true, thats why we at my work have all kinds of memcpy for special cases, also you can remove the checks in your memcpy, and making it faster again.
Noware
|
|
|
|
|
Logged
|
Reporter - What do you think of western civilization? Gandhi - I think it would be a good idea!
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #13 on: September 02, 2008, 08:24:54 PM » |
|
I took the opportunity and changed your bench code a little bit and tested it against daniels, flatmushs and my implementation of a memcpy. Here's the results: libc daniel flatmush raphael 1 bytes ( 4 bytes aligned), 100.0%, 91.0%, 69.8%, 101.5% 2 bytes ( 4 bytes aligned), 100.0%, 92.7%, 77.0%, 101.2% 4 bytes ( 4 bytes aligned), 100.0%, 135.9%, 111.8%, 100.4% 8 bytes ( 4 bytes aligned), 100.0%, 125.5%, 170.2%, 150.6% 16 bytes ( 4 bytes aligned), 100.0%, 63.6%, 102.2%, 84.9% 32 bytes ( 4 bytes aligned), 100.0%, 87.9%, 116.7%, 89.4% 64 bytes ( 4 bytes aligned), 100.0%, 96.3%, 116.5%, 188.3% 128 bytes ( 4 bytes aligned), 100.0%, 104.1%, 116.6%, 243.1% 256 bytes ( 4 bytes aligned), 100.0%, 112.5%, 119.7%, 312.1% 512 bytes ( 4 bytes aligned), 100.0%, 111.3%, 116.6%, 350.7% 1024 bytes ( 4 bytes aligned), 100.0%, 112.3%, 117.3%, 381.5% 2048 bytes ( 4 bytes aligned), 100.0%, 116.2%, 118.5%, 390.1% 4096 bytes ( 4 bytes aligned), 100.0%, 116.1%, 116.1%, 412.4% 8192 bytes ( 4 bytes aligned), 100.0%, 116.6%, 117.3%, 416.0% 16384 bytes ( 4 bytes aligned), 100.0%, 103.9%, 104.0%, 1103.8% 32768 bytes ( 4 bytes aligned), 100.0%, 103.8%, 104.0%, 182.6% 65536 bytes ( 4 bytes aligned), 100.0%, 103.8%, 104.0%, 182.6% 1 bytes ( source + 1), 100.0%, 92.4%, 62.4%, 100.8% 2 bytes ( source + 1), 100.0%, 93.2%, 72.9%, 101.2% 4 bytes ( source + 1), 100.0%, 83.8%, 75.9%, 100.4% 8 bytes ( source + 1), 100.0%, 85.7%, 79.7%, 84.9% 16 bytes ( source + 1), 100.0%, 118.0%, 79.1%, 106.4% 32 bytes ( source + 1), 100.0%, 144.2%, 85.6%, 124.1% 64 bytes ( source + 1), 100.0%, 200.1%, 86.5%, 138.8% 128 bytes ( source + 1), 100.0%, 268.4%, 91.3%, 242.1% 256 bytes ( source + 1), 100.0%, 283.3%, 87.1%, 348.2% 512 bytes ( source + 1), 100.0%, 309.5%, 87.5%, 470.0% 1024 bytes ( source + 1), 100.0%, 324.5%, 87.8%, 559.7% 2048 bytes ( source + 1), 100.0%, 333.0%, 87.6%, 616.9% 4096 bytes ( source + 1), 100.0%, 335.0%, 87.5%, 664.2% 8192 bytes ( source + 1), 100.0%, 332.3%, 87.6%, 686.5% 16384 bytes ( source + 1), 100.0%, 227.6%, 90.5%, 939.7% 32768 bytes ( source + 1), 100.0%, 227.8%, 90.4%, 448.6% 65536 bytes ( source + 1), 100.0%, 228.0%, 90.4%, 450.2% 1 bytes ( source + 2), 100.0%, 91.7%, 69.5%, 100.8% 2 bytes ( source + 2), 100.0%, 93.2%, 69.8%, 101.2% 4 bytes ( source + 2), 100.0%, 83.5%, 91.2%, 100.4% 8 bytes ( source + 2), 100.0%, 89.6%, 110.6%, 89.2% 16 bytes ( source + 2), 100.0%, 121.7%, 131.9%, 109.6% 32 bytes ( source + 2), 100.0%, 155.3%, 148.5%, 126.5% 64 bytes ( source + 2), 100.0%, 211.2%, 159.8%, 137.3% 128 bytes ( source + 2), 100.0%, 238.5%, 164.7%, 233.8% 256 bytes ( source + 2), 100.0%, 300.1%, 175.0%, 357.4% 512 bytes ( source + 2), 100.0%, 310.4%, 174.7%, 464.9% 1024 bytes ( source + 2), 100.0%, 325.5%, 174.6%, 562.4% 2048 bytes ( source + 2), 100.0%, 333.1%, 174.9%, 616.7% 4096 bytes ( source + 2), 100.0%, 334.9%, 174.9%, 665.2% 8192 bytes ( source + 2), 100.0%, 332.5%, 174.1%, 686.0% 16384 bytes ( source + 2), 100.0%, 227.1%, 143.8%, 935.4% 32768 bytes ( source + 2), 100.0%, 227.6%, 143.9%, 448.1% 65536 bytes ( source + 2), 100.0%, 227.8%, 144.0%, 449.7% 1 bytes ( source + 3), 100.0%, 91.0%, 55.2%, 101.5% 2 bytes ( source + 3), 100.0%, 92.6%, 72.1%, 100.6% 4 bytes ( source + 3), 100.0%, 83.4%, 75.8%, 100.4% 8 bytes ( source + 3), 100.0%, 95.7%, 79.7%, 93.2% 16 bytes ( source + 3), 100.0%, 129.1%, 84.0%, 113.4% 32 bytes ( source + 3), 100.0%, 161.8%, 85.5%, 129.4% 64 bytes ( source + 3), 100.0%, 217.2%, 86.5%, 142.9% 128 bytes ( source + 3), 100.0%, 263.1%, 84.8%, 229.4% 256 bytes ( source + 3), 100.0%, 297.3%, 86.4%, 353.0% 512 bytes ( source + 3), 100.0%, 319.2%, 87.8%, 470.7% 1024 bytes ( source + 3), 100.0%, 324.6%, 87.7%, 550.9% 2048 bytes ( source + 3), 100.0%, 333.5%, 87.7%, 616.9% 4096 bytes ( source + 3), 100.0%, 336.6%, 87.6%, 666.5% 8192 bytes ( source + 3), 100.0%, 332.7%, 87.6%, 688.3% 16384 bytes ( source + 3), 100.0%, 223.9%, 90.9%, 924.1% 32768 bytes ( source + 3), 100.0%, 224.2%, 90.9%, 441.3% 65536 bytes ( source + 3), 100.0%, 224.4%, 90.9%, 443.0% 1 bytes ( dest + 1), 100.0%, 90.4%, 69.5%, 101.5% 2 bytes ( dest + 1), 100.0%, 92.7%, 62.4%, 101.2% 4 bytes ( dest + 1), 100.0%, 83.2%, 61.2%, 100.9% 8 bytes ( dest + 1), 100.0%, 98.3%, 63.8%, 113.1% 16 bytes ( dest + 1), 100.0%, 131.6%, 70.8%, 129.7% 32 bytes ( dest + 1), 100.0%, 204.6%, 76.1%, 140.0% 64 bytes ( dest + 1), 100.0%, 252.0%, 81.1%, 460.6% 128 bytes ( dest + 1), 100.0%, 263.7%, 83.5%, 555.5% 256 bytes ( dest + 1), 100.0%, 297.3%, 85.0%, 622.5% 512 bytes ( dest + 1), 100.0%, 328.0%, 86.4%, 670.4% 1024 bytes ( dest + 1), 100.0%, 334.7%, 87.4%, 675.0% 2048 bytes ( dest + 1), 100.0%, 335.9%, 87.5%, 694.8% 4096 bytes ( dest + 1), 100.0%, 337.9%, 87.5%, 706.0% 8192 bytes ( dest + 1), 100.0%, 334.3%, 87.6%, 713.5% 16384 bytes ( dest + 1), 100.0%, 215.1%, 90.0%, 901.0% 32768 bytes ( dest + 1), 100.0%, 215.2%, 90.0%, 439.6% 65536 bytes ( dest + 1), 100.0%, 215.3%, 90.0%, 440.1% 1 bytes ( dest + 2), 100.0%, 94.5%, 72.5%, 105.4% 2 bytes ( dest + 2), 100.0%, 94.9%, 78.9%, 103.7% 4 bytes ( dest + 2), 100.0%, 85.2%, 81.1%, 87.2% 8 bytes ( dest + 2), 100.0%, 105.0%, 91.1%, 114.7% 16 bytes ( dest + 2), 100.0%, 137.8%, 102.6%, 130.3% 32 bytes ( dest + 2), 100.0%, 211.8%, 127.5%, 140.7% 64 bytes ( dest + 2), 100.0%, 246.2%, 146.3%, 460.6% 128 bytes ( dest + 2), 100.0%, 300.8%, 161.9%, 573.9% 256 bytes ( dest + 2), 100.0%, 320.1%, 170.1%, 597.2% 512 bytes ( dest + 2), 100.0%, 329.0%, 169.5%, 670.3% 1024 bytes ( dest + 2), 100.0%, 335.3%, 173.2%, 676.1% 2048 bytes ( dest + 2), 100.0%, 336.2%, 174.2%, 695.0% 4096 bytes ( dest + 2), 100.0%, 338.1%, 174.4%, 705.7% 8192 bytes ( dest + 2), 100.0%, 335.8%, 174.4%, 713.2% 16384 bytes ( dest + 2), 100.0%, 215.1%, 148.7%, 910.1% 32768 bytes ( dest + 2), 100.0%, 215.2%, 148.8%, 439.7% 65536 bytes ( dest + 2), 100.0%, 215.3%, 148.8%, 440.1% 1 bytes ( dest + 3), 100.0%, 106.9%, 82.0%, 117.4% 2 bytes ( dest + 3), 100.0%, 105.7%, 70.7%, 114.1% 4 bytes ( dest + 3), 100.0%, 92.3%, 60.0%, 111.1% 8 bytes ( dest + 3), 100.0%, 107.7%, 68.1%, 120.1% 16 bytes ( dest + 3), 100.0%, 139.3%, 70.2%, 134.0% 32 bytes ( dest + 3), 100.0%, 222.9%, 81.7%, 150.6% 64 bytes ( dest + 3), 100.0%, 236.8%, 81.2%, 474.2% 128 bytes ( dest + 3), 100.0%, 293.8%, 83.0%, 564.1% 256 bytes ( dest + 3), 100.0%, 298.2%, 86.4%, 626.6% 512 bytes ( dest + 3), 100.0%, 328.9%, 86.5%, 671.1% 1024 bytes ( dest + 3), 100.0%, 333.2%, 87.3%, 675.5% 2048 bytes ( dest + 3), 100.0%, 336.3%, 87.4%, 703.8% 4096 bytes ( dest + 3), 100.0%, 336.9%, 87.5%, 705.9% 8192 bytes ( dest + 3), 100.0%, 336.8%, 87.8%, 715.4% 16384 bytes ( dest + 3), 100.0%, 215.2%, 90.2%, 908.4% 32768 bytes ( dest + 3), 100.0%, 215.2%, 90.2%, 439.6% 65536 bytes ( dest + 3), 100.0%, 215.3%, 90.2%, 440.0% 1 bytes (source + 1, dest + 3), 100.0%, 105.4%, 72.8%, 119.2% 2 bytes (source + 1, dest + 3), 100.0%, 105.1%, 82.3%, 114.8% 4 bytes (source + 1, dest + 3), 100.0%, 91.9%, 83.6%, 110.6% 8 bytes (source + 1, dest + 3), 100.0%, 93.3%, 84.9%, 90.6% 16 bytes (source + 1, dest + 3), 100.0%, 124.3%, 87.2%, 110.2% 32 bytes (source + 1, dest + 3), 100.0%, 156.6%, 87.3%, 126.5% 64 bytes (source + 1, dest + 3), 100.0%, 211.4%, 86.1%, 140.3% 128 bytes (source + 1, dest + 3), 100.0%, 264.4%, 89.4%, 237.3% 256 bytes (source + 1, dest + 3), 100.0%, 298.7%, 88.8%, 336.4% 512 bytes (source + 1, dest + 3), 100.0%, 317.3%, 87.5%, 465.3% 1024 bytes (source + 1, dest + 3), 100.0%, 325.2%, 87.9%, 552.9% 2048 bytes (source + 1, dest + 3), 100.0%, 333.2%, 87.6%, 623.5% 4096 bytes (source + 1, dest + 3), 100.0%, 336.4%, 87.6%, 665.3% 8192 bytes (source + 1, dest + 3), 100.0%, 329.7%, 87.7%, 691.2% 16384 bytes (source + 1, dest + 3), 100.0%, 214.6%, 90.1%, 875.6% 32768 bytes (source + 1, dest + 3), 100.0%, 214.9%, 90.0%, 430.0% 65536 bytes (source + 1, dest + 3), 100.0%, 215.2%, 90.0%, 431.5% 1 bytes (source + 2, dest + 2), 100.0%, 93.8%, 71.6%, 104.6% 2 bytes (source + 2, dest + 2), 100.0%, 94.9%, 71.2%, 103.1% 4 bytes (source + 2, dest + 2), 100.0%, 85.2%, 92.8%, 102.2% 8 bytes (source + 2, dest + 2), 100.0%, 103.5%, 111.2%, 93.0% 16 bytes (source + 2, dest + 2), 100.0%, 145.3%, 132.9%, 133.2% 32 bytes (source + 2, dest + 2), 100.0%, 191.2%, 142.2%, 204.3% 64 bytes (source + 2, dest + 2), 100.0%, 287.5%, 160.3%, 140.1% 128 bytes (source + 2, dest + 2), 100.0%, 389.1%, 158.7%, 278.3% 256 bytes (source + 2, dest + 2), 100.0%, 482.3%, 172.8%, 472.9% 512 bytes (source + 2, dest + 2), 100.0%, 525.3%, 175.0%, 810.9% 1024 bytes (source + 2, dest + 2), 100.0%, 581.8%, 173.8%, 1153.0% 2048 bytes (source + 2, dest + 2), 100.0%, 596.6%, 174.9%, 1522.6% 4096 bytes (source + 2, dest + 2), 100.0%, 610.8%, 174.8%, 1813.8% 8192 bytes (source + 2, dest + 2), 100.0%, 583.6%, 173.6%, 1964.9% 16384 bytes (source + 2, dest + 2), 100.0%, 268.8%, 148.6%, 2414.6% 32768 bytes (source + 2, dest + 2), 100.0%, 269.3%, 148.8%, 473.2% 65536 bytes (source + 2, dest + 2), 100.0%, 269.6%, 148.8%, 474.7% 1 bytes (source + 3, dest + 1), 100.0%, 92.4%, 56.6%, 102.3% 2 bytes (source + 3, dest + 1), 100.0%, 92.7%, 72.6%, 101.9% 4 bytes (source + 3, dest + 1), 100.0%, 84.1%, 58.3%, 101.3% 8 bytes (source + 3, dest + 1), 100.0%, 101.1%, 66.2%, 93.2% 16 bytes (source + 3, dest + 1), 100.0%, 134.2%, 84.0%, 113.2% 32 bytes (source + 3, dest + 1), 100.0%, 187.6%, 96.6%, 145.7% 64 bytes (source + 3, dest + 1), 100.0%, 238.8%, 93.4%, 154.3% 128 bytes (source + 3, dest + 1), 100.0%, 276.8%, 90.5%, 246.0% 256 bytes (source + 3, dest + 1), 100.0%, 297.7%, 88.9%, 358.2% 512 bytes (source + 3, dest + 1), 100.0%, 312.9%, 87.6%, 470.7% 1024 bytes (source + 3, dest + 1), 100.0%, 328.8%, 87.4%, 561.5% 2048 bytes (source + 3, dest + 1), 100.0%, 333.8%, 87.6%, 623.5% 4096 bytes (source + 3, dest + 1), 100.0%, 336.8%, 87.6%, 664.9% 8192 bytes (source + 3, dest + 1), 100.0%, 329.6%, 87.7%, 689.7% 16384 bytes (source + 3, dest + 1), 100.0%, 227.0%, 91.3%, 924.8% 32768 bytes (source + 3, dest + 1), 100.0%, 227.5%, 91.3%, 448.0% 65536 bytes (source + 3, dest + 1), 100.0%, 227.7%, 91.2%, 449.4%
The percentage is how fast it is (compared to libc, as indicated by that always having 100%). The more the faster. I uses memalign(64) to allocate the test buffers to go in favor of flatmushs implementation. My own implementation works same speed with raw mallocs (aligned to 16 bytes). What I noticed: -Flatmushs implementation's weekness, as already stated is small copies and especially unaligned as he didn't really care. It does allaround better than libc though -Daniel's does especially well with unaligned copies. It also has a special case for 4bytes aligned copy which makes it stand out in that single case. -Both (same as libc) behave bad when they start to reach the dcache size limit (16kb), where my implementation still gains full boost of cache (hence there's an incredible peek of ~1000+% at 16kb copies). This is because the others share the dcache for read and write. -My implementation's week point is in source unaligned copies in range 8-128 bytes, as I just didn't build in enough special cases for that. It's still better than libc and flatmushs there though. -Measures are uncorrect to ~1% and are sometimes even quite a bit more off (I had one run where it would measure 150%+ for daniels and my implementation for <=4 byte aligned copies), which I can't exactly explain yet. -I also can't fully explain the drop in performance for 16 and 32 bytes aligned copies, as the same code as for 8 byte kicks in... On libc's side at 16bytes it's special code kicks in, so that at least makes up for *a* drop, but still I don't quite see why I'm getting only ~80% where it should at least be close to 100% or in the mid 90s. Explanation onto my implementation: I use vfpu copies for anything >= 64 bytes, bypassing the cpu write cache and instead making use of vfpu write cache. This allows for the whole 16kb dcache to be used for src and hence explains the speed up to this size. For smaller sizes, i either fall back to raw byte per byte copy (< 8 bytes) or do a dst align if needed and do 32bit writes as far as possible. The biggest performance for 2/2 unaligned 16kb copies comes from the fact that libc doesn't handle that case to just copy 2 bytes and then do aligned copies, but fall back to raw byte copies, while my implementation does the alignment correction. Hence, the 2/2 unaligned copies are actually as fast as the aligned copies, just libc behaves worse so I (and daniel) get much better performance values there.
|
|
|
|
« Last Edit: September 02, 2008, 08:41:23 PM by Raphael »
|
Logged
|
|
|
|
Raphael
Global Moderator
Hero Member
Karma: +230/-10
Offline
Posts: 1431
193700.11 points View InventorySend Money to Raphael
|
 |
« Reply #14 on: September 02, 2008, 08:26:19 PM » |
|
And here's the full code of the test application I ran:
I opted to copy all memcpy functions in the code and make them not inlined, as else the tests would have gotten unrealistic results (I tried calling memcpy from my implementation for cases < 16 byte and then got better results than memcpy itself). I also cleared the D- and I-Cache before each implementations run, to avoid the (small) advantage of the cache kicking in in the first run where it wouldn't for libc. I also chose to output the information in percentage to make the differences clearer. Those time span numbers are just too hard to compare.
PS: It was compiled on GCC 4.1.0 using -O3
|
|
|
|
« Last Edit: September 03, 2008, 07:12:54 AM by Raphael »
|
Logged
|
|
|
|
|