well, modern cpus have tricks that allow the cpu to do all sorts of things asynchronously, which allows certain things to take less than a clock if its simple enough for the instruction decoding to rearrange things. memory references are far from simple, so try to prevent the compiler from spilling locals to stack
cpus have limited associativity, which means that if you start reading/writing some new location, the cpu will forget somewhere else and re-accessing somewhere that was forgotten will incur a stall as it repopulates from a higher level cache. if that means that it needs to re-read from system ram, then expect a large stall. If you have two CPUs/GPUs trying to write the same memory region then you'll find that they will constantly purge the other's cache of that region (the alternative is worse...).
regarding structs/unions, I wouldn't worry about the offset. if its on the stack then esp+(offset) is identical to esp+(offset+2). globals don't need the esp part, of course. either way you no longer need a separate register to hold the pointer's value, so more registers for your actual maths.
remember that the x86 only has 8 registers (amd64 raises that to 16). and many of them are reserved for specific purposes. So if you need to hold many variables at a time, you'll end up spilling all your variables to stack, and now each reference to those variables will need load+store operations too. constants can often be embedded in the instructions themselves, which helps reduce cache misses etc.
c99's __restrict keyword might be useful to you, as it allows greater freedom by letting the compiler know that writes will not change other memory addresses. this isn't normally an issue for locals if nothing needed an address to them, but your code decided to take an address to it, and now it might need to read the pointer and THEN dereference it as two separate operations. and only then can it read from it.
Or something.
If this stuff is important to you, you should really try and figure out how to use cachegrind -
http://valgrind.org/docs/manual/cg-manual.htmlFor raw clock costs, you should use something like gprof (which requires compiler instrumentation). Expect different results on different cpus (especially but not just with different instruction sets), or even with each run (interrupts etc will flush caches, which will affect reported costs).
.