Reputation: 10380
I recently had an interview question where I had to implement memcpy. I've used memcpy plenty in my experience so it didn't seem like a tough problem.
So, I started implementing a loop to copy one address at a time from pointer to pointer, something like this:
void memcpy(void* dest, void* src, int size){
for(int index = 0; index < size; index++){
dest[index] = src[index];
}
}
However the interviewers interrupted noting that the man page for memcpy says it "copies n bytes from src to dest" (which I confirmed later) and then wanted me to iterate instead by size/4 and then pick up the remaining with another loop of index < size%4 (I guess assuming it was a 32 bit system?)
Well, this seemed strange seeing as how I've used memcpy for years with no problem without having to give it a *4 modifier). When I got home I fired up gdb and copied a small string "hello" and was careful to input the size both with strlen() and constants to see where it starts and stops.
char* src = "hello";
char* dest = calloc(16, sizeof(char));
int len = strlen(src);
memcpy(dest, src, len); // both my version and official version
Now I carefully examined src and dest with gdb which both contained "hello\0".
So my question is: what am I not understanding about using the number 4, (or "size in bytes")? And why does the documentation say "n bytes" when that's not really the behavior? What am I not seeing clearly here?
Upvotes: 10
Views: 19658
Reputation: 4152
Check this out..
void myMemCpy(void *dest, void *src, size_t n)
{
// Typecast src and dest addresses to (char *)
char *csrc = (char *)src;
char *cdest = (char *)dest;
// Copy contents of src[] to dest[]
for (int i=0; i<n; i++)
cdest[i] = csrc[i];
}
Upvotes: 0
Reputation: 213809
The interviewers asked you to perform premature optimization, for one reason or another. This is usually a bad idea.
It is true that a 32-bit machine will copy one 32-bit chuck faster than it will copy 4x1 bytes. But there is more to optimization than that.
There is a big chance that a 32-bit machine puts your data in cache memory, and then suddenly fast memory access might be far more relevant than CPU instructions. Cache memories might have various alignment requirements. They may prefer a plain loop or they may prefer 32-bit aligned chunks. I'm not an expert on the subject, so I avoid premature optimization and leaves it to the compiler, which hopefully knows more about cache memories than I do.
Then there is CPU branch prediction and instruction piping. This particular code is fairly deterministic, so this might not be an issue. But as a rule of thumb: simple code yields more effective branch prediction than complex code.
Furthermore, there is division, which is slow on many CPU architectures. Depending on the amount of data to copy, the divisions may cause the memcpy to be far slower.
To sum it up: manual optimization is quite complex and requires in-depth knowledge of the CPU and hardware. You cannot and should not "optimize for a 32-bit CPU", you need to know the specifics. In most cases the compiler will optimize the code far more effectively than you will. The library memcpy() in particular, is often written in inline assembler, optimized for the specific target.
Upvotes: 1
Reputation: 596156
As others have said, copying 4 bytes at a time is faster than copying 1 byte at a time. The interviewer wanted you to do something like this:
void memcpy(void* dest, void* src, int size)
{
uint8_t *pdest = (uint8_t*) dest;
uint8_t *psrc = (uint8_t*) src;
int loops = (size / sizeof(uint32_t));
for(int index = 0; index < loops; ++index)
{
*((uint32_t*)pdest) = *((uint32_t*)psrc);
pdest += sizeof(uint32_t);
psrc += sizeof(uint32_t);
}
loops = (size % sizeof(uint32_t));
for (int index = 0; index < loops; ++index)
{
*pdest = *psrc;
++pdest;
++psrc;
}
}
Upvotes: 19
Reputation: 5537
The interviewer was testing your knowledge of computer architecture, and wanted you to optimize your algorithm. Memory operates on words rather than bytes. In a 32-bit system, a word is typically 4 bytes, it takes the same amount of time to read/write 1 byte as it does to read/write 1 word. The second loop is to handle the case where the number of bytes you want to copy is not exactly divisible by 4 bytes.
What you actually want is 3 loops. 2 loops for the bytes after dest and before dest+size that when either is not word aligned. Then another loop for all the words in between.
You can actually optimize even more by taking advantage of architecture specific instructions. Check out this article if you are interested: http://www.eetimes.com/design/embedded/4024961/Optimizing-Memcpy-improves-speed
Upvotes: 0
Reputation: 17329
The logic for your memcpy is correct and your interviewer didn't ask you to change it or add a restriction. Copying 4 bytes at a time is faster, but becomes a problem if your size is not a multiple of 4. Hence your interviewer told you to use two loops: the first copies 4 bytes at a time, and the 2nd loop one byte at a time (it will iterate at most 3 times).
So the bulk of the copy is done with the fast 4-byte copy, but you're not restricted to size being a multiple of 4 because the 2nd "cleanup" loop will copy anything that's not a multiple of 4.
1st loop: copy uint32_t and increment by 4
2nd loop: copy uint8_t and increment by 1
Upvotes: 1
Reputation: 4922
They wanted you to speed it up. A 32-bit processor can copy 32 bits faster than it can copy 8 bits. So if someone wants to copy 4 bytes rather than do it one at a time then you can do it all at once.
Upvotes: 0
Reputation: 361605
They were asking you to optimize your implementation and have it copy a 32-bit word at a time inside the loop vs. a byte at a time. This would necessitate some careful checking to handle the boundary cases, such as size
not being a multiple of 4, or dest
or src
not being aligned on a 4-byte boundary.
Upvotes: 13