Sorry, I may have gotten a bit competitive. By no means am I saying your approach isn't valid, you still managed to make significant improvements and your thought process is very well explained.
I think this is an excellent demonstration of how targeting algorithm complexity is almost always the best (and thus first) place to begin looking for optimizations.
And I guess I did spoil the fun for the original poster...
original post modified to hint at the potential but left the implementation as an exercise for the reader (only this time I did the exercise, too).