Because it's working with arrays of double which is 8 bytes wide, and the optimized loop doesn't use any scaled indexing; basically the loop index is pre-scaled to the element size. Perhaps those MMX instructions don't support addressing modes with index scaling (wild guess).
The unoptimized code increments by 1 to 65535, but the memory accesses use scaling. Well, not exactly. We see this:
This LEA here, though it means "load effective address" is not actually an effective address calculation. The base address is zero, and so this is just LEA being exploited to multiply RAX by 8, and get that into RDX. RAX is then clobbered with the base address of an array, to which the scaled displacement is added and then finally used to make an access.
The unoptimized code increments by 1 to 65535, but the memory accesses use scaling. Well, not exactly. We see this:
This LEA here, though it means "load effective address" is not actually an effective address calculation. The base address is zero, and so this is just LEA being exploited to multiply RAX by 8, and get that into RDX. RAX is then clobbered with the base address of an array, to which the scaled displacement is added and then finally used to make an access.