Down For 8 Days: American Eagle’s Site Disaster
Written by Frank Hayes and Evan SchumanIn one of the longest site outages ever for a multi-billion-dollar retailer, Tuesday (July 27) saw the apparent end of more than a week of Web problems and days of an outright crashed site for Pittsburgh-based clothing chain American Eagle Outfitters, which outsources much of its Web operations to IBM. The site crashed last Monday (July 19) and stayed dark until Friday (July 23), when it limped along with various parts not functioning until Tuesday afternoon (July 27).
The site’s problems, though, shed light on an interesting strategy. During the many days of complete Web site death, the $2.7 billion apparel chain’s mobile site was still up. But it apparently was not able to perform purchases. Officials at American Eagle Outfitters, IBM and Usablenet—which handles the chain’s mobile site—wouldn’t comment on the mobile site’s functionality during the crash.
New Details About The Crash Causes: Oracle Backup The Culprit, Along With Big Blue
But this raises the question: Should retailers look to their mobile sites as emergency backups for their Web sites? Should pages indicating that a site is down automatically include a link to the site’s mobile version?
Mobile sites, of course, work just as well on desktop machines as they do on phones. American Eagle Outfitters, which has the admirably short URL of ae.com, exists as a mobile site.
Before we dive into that mobile-as-site-backup issue, let’s look at exactly what happened with American Eagle’s site. None of the players involved would get specific as to what was wrong with the site, other than to say that there was no upgrade going on at the time and that the site experienced “a hardware issue.”
A server failure almost certainly would not have caused this problem; redundant servers would likely have kicked in while the defective machine was replaced with a new server and a backup was restored. That process would have taken a few hours, not almost eight days.
This delay suggests some sort of storage problem. Say the storage array begins to fail. OK, no problem, we’ll just find the bad drive and replace it. Whoops, looks like something has corrupted multiple drives. (That could happen if power gets flaky inside the array.) Now we have a catastrophic failure of the storage array. No problem, we’ll just fix the hardware and restore.
Whoops, new problem: Turns out this problem has been going on for a while. The last set of backups is corrupted. So is the set of backups before that. Sorting through to reconstruct good data is going to take time.
Alternatively: All recent backup sets are toast. Maybe nobody was verifying that the data was actually being written. However, all the transactions are being logged. No problem, then: All it takes is a lot of time and special expertise to essentially rerun all the recent transactions (since the last good backup) into an empty database, merge the new stuff with the old stuff and then load it all back into the replacement hardware.
By the way, it seems that American Eagle was recently searching for a “Manager – Business Continuity & Disaster Recovery”. The job was still an active posting on May 25 but has since been filled. Not a moment too soon, eh? (Thanks, Google cache!)
July 29th, 2010 at 4:26 am
Contingency planning is frought with all sorts of pitfalls. The suggestion about running your mobile site on “mirrored versions of the key databases” sounds great, aprt from in AE’s case the gradual curruption of the main site’s databases due to the array problem would also be “mirrored” onto the mobile site.
You could handle bandwidth issues by locating in the same datacentre and sharing the main site’s bandwidth. But that leaves both sites vulnerable to both a bandwidth outage or a datacentre failure (say, the power supply fails.
It reminds me of the phrase currently very popular with politians (certainly over here in the UK) “it’s a problem of unintended consequences”.