advertisement
advertisement

Oracle Backup Failure Major Factor In American Eagle 8-Day Crash

Written by Evan Schuman
July 30th, 2010

It seems a failure in an Oracle backup utility coupled with the failure of IBM hosting managers to detect it and to verify that a disaster recovery site was operational were the key factors in turning a standard site outage at American Eagle Outfitters into an 8-day-long disaster, according to an IT source involved in the probe.

The initial problem was pretty much along the lines of what StorefrontBacktalk reported on Thursday (July 29), which was a series of server failures. But the problems with two of the biggest names in retail tech–IBM and Oracle–are what made this situation balloon into a nightmare.

“The storage drive went down at IBM hosting and, immediately after that, the secondary drive went down. Probably a one-in-a-million possibility, but it happened,” said an IT source involved in the probe. “Once replaced, they tried to do a restore, and backups would not restore with the Oracle backup utility. They had 400 gigabytes (of data) and they were only getting 1 gigabyte per hour restoring. They got it up to 5 gigabytes per hour, but the restores kept failing. I don’t know if there was data corruption of a faulty process.”

Thus far, that’s pretty bad. It’s a statistically unlikely problem, but site management had insisted on state-of-the-art backup and restore packages so there shouldn’t be a huge problem, right? Not quite.

“The final straw was the disaster recovery site, which was not ready to go,” the source said. “They apparently could not get the active logs rolling in the disaster recovery site. I know they were supposed to have completed it with Oracle Data Guard, but apparently it must have fallen off the priority list in the past few months and it was not there when needed.”

The source added that these situations–as bad as they are–are simply part of the risks of using managed service arrangements at hosting firms, as opposed to handling site management remotely–and with your own salaried people–at a collocation site.

Some IT problems are hard to assign blame for, such as a direct lightning strike that overpowers power management systems. But having a multi-billion-dollar E-Commerce site completely down for several days–and crippled, functionality-wise, for eight days–because of backups and a disaster recovery site that weren’t being maintained? That’s borderline criminal. Actually, that’s not fair. We shouldn’t have said borderline.

Consider this line: “I know they were supposed to have completed it with Oracle Data Guard, but apparently it must have fallen off the priority list in the past few months and it was not there when needed.” Fallen off the priority list in the past few months? IBM’s job is to protect huge E-Commerce sites. After the initial setup, there’s not much to do beyond monitor and make sure that backups happen and are functional.

IBM isn’t a low-cost vendor, so it will be interesting to see whether those hosting fees are justified. As our source put it: “I am sure there will be an big issue with IBM about getting payback.”

What lesson should CIOs and E-Commerce directors take from this incident? They are paying for backup and for a high-end vendor to make sure that backup is working. What more should be required? Does a vendor like IBM require babysitting, where staff is periodically dispatched to the server farm for a surprise inspection of backups?

Perhaps that should be part of an expanded Service-Level Agreement, but that SLA had better include a huge and immediate financial penalty if those inspections find anything naughty. If this American Eagle doesn’t get the attention of hosting firms, maybe those penalties will.


advertisement

10 Comments | Read Oracle Backup Failure Major Factor In American Eagle 8-Day Crash

  1. Sean Connolly Says:

    Sounds very similar to the Microsoft/Danger/T-Mobile event last year!

    Sean

  2. Bill Bittner Says:

    As Mr. Reagon would have said: “Trust but verify.”

    Any backup plan must also include full scale dry runs. Pick a slow night and go through the whole thing including switching to the backup site for a day AND coming back to the primary. Do this at least once a month and as the last step as you begin your holiday season systems freeze.

  3. Jim B Says:

    This is another reason showing why outsourcing without brains is a very bad thing. Anyone who outsuorces doesn’t mean “abdicates” yet in many situations that is what seems to happen. You hand off the job to someone else but don’t keep the managing and monitoring in place in your own shop to make sure they are supporting the business each an every second of every day.

    Issue with penalties, it’s like holding a bigger stick over your dog, eventually it loses it’s punch. Penalites like this will not spur or enable better performance. You are just one of many customers hosted at IBM’s site and to them, that’s it, one of many.

    I’m not saying yes or no to outsourcing – as Mr Bittner says it has to be verifed and that is the company’s responsibility not the fox’s.

  4. Anonymous Says:

    Was there an audit clause in the contract between the two parties? Does IBM or Oracle conduct SAS 70 Type II Audits/Agreed Upon Procedures? Does IBM or Oracle conduct tests of their backups to ensure they can recover. Just asking…

  5. Fabien Tiburce, President, Compliantia Says:

    It be interesting to know the liability implications and penalties involved. Outsourcing contracts typically include such clauses. Our own SaaS contract entitles customers a 2% discount for each 1% drop in service as monitored by pingdom.com Bruised reputation aside, I wonder what this is going to cost IBM…

  6. Sid Sidner Says:

    IBM out-sourcing has a very bad reputation among their clients. In my experience, many were looking for ways to get out of their 5 year contracts. It is sad, really.

  7. Ace DBA Says:

    Quote:
    “Once replaced, they tried to do a restore, and backups would not restore with the Oracle backup utility. They had 400 gigabytes (of data) and they were only getting 1 gigabyte per hour restoring. They got it up to 5 gigabytes per hour, but the restores kept failing. I don’t know if there was data corruption of a faulty process.”
    This seems more like a hardware problem with the tape management system. A modern LTO-3 drive can output data at better than 200MB per second ( or about 5 minutes per GB – a bit more than 33 hours for 400GB. )- faster than many SANs can pass it. I would question the hardware and SAN used to do the backups here, not the Oracle recovery software. By the way, at 5GB per hour, it would have taken about three days to do the restore – one good reason to have the Data Guard fail over site.

    As for the Oracle Data Guard. There is no excuse at all for this not working. It is all about monitoring here – very easy to do. Now, if the redo logs were not being applied, it is quite simple to discover when this stopped happening. There is quite a bit that really good DBA might do to get the fail over site up and running. For instance, he might seek to pull the missing redo logs from the backup tapes and manually apply them to the standby database to catch it up.

    Ultimately, it comes down to who was watching the fail over system and, more importantly, who is watching the watchers?

  8. HA Guy Says:

    >> As for the Oracle Data Guard. There is no excuse at all for this not working. It is all about monitoring here – very easy to do.

    Actually, this article states that they didn’t even implement Data Guard and they knew that. Oracle Data Guard or some similar host-based replication technology would have likely saved them from this outage. Note that I say “host-based”, because any storage mirroring technology would have propagated the corrupted bits to the remote DR volumes (if physical data corruption was indeed the cause of their outage).

  9. Ravi Says:

    If you don’t have a good DBA team then you have to suffer.

  10. RJWitty Says:

    A reduction in outsourcing/hosting fees is not adequate. There has to be compensation for lost business. Outsourcers can’t hide behind the customer’s business interruption insurance.

Newsletters

StorefrontBacktalk delivers the latest retail technology news & analysis. Join more than 60,000 retail IT leaders who subscribe to our free weekly email. Sign up today!
advertisement

Most Recent Comments

Why Did Gonzales Hackers Like European Cards So Much Better?

I am still unclear about the core point here-- why higher value of European cards. Supply and demand, yes, makes sense. But the fact that the cards were chip and pin (EMV) should make them less valuable because that demonstrably reduces the ability to use them fraudulently. Did the author mean that the chip and pin cards could be used in a country where EMV is not implemented--the US--and this mis-match make it easier to us them since the issuing banks may not have as robust anti-fraud controls as non-EMV banks because they assumed EMV would do the fraud prevention for them Read more...
Two possible reasons that I can think of and have seen in the past - 1) Cards issued by European banks when used online cross border don't usually support AVS checks. So, when a European card is used with a billing address that's in the US, an ecom merchant wouldn't necessarily know that the shipping zip code doesn't match the billing code. 2) Also, in offline chip countries the card determines whether or not a transaction is approved, not the issuer. In my experience, European issuers haven't developed the same checks on authorization requests as US issuers. So, these cards might be more valuable because they are more likely to get approved. Read more...
A smart card slot in terminals doesn't mean there is a reader or that the reader is activated. Then, activated reader or not, the U.S. processors don't have apps certified or ready to load into those terminals to accept and process smart card transactions just yet. Don't get your card(t) before the terminal (horse). Read more...
The marketplace does speak. More fraud capacity translates to higher value for the stolen data. Because nearly 100% of all US transactions are authorized online in real time, we have less fraud regardless of whether the card is Magstripe only or chip and PIn. Hence, $10 prices for US cards vs $25 for the European counterparts. Read more...
@David True. The European cards have both an EMV chip AND a mag stripe. Europeans may generally use the chip for their transactions, but the insecure stripe remains vulnerable to skimming, whether it be from a false front on an ATM or a dishonest waiter with a handheld skimmer. If their stripe is skimmed, the track data can still be cloned and used fraudulently in the United States. If European banks only detect fraud from 9-5 GMT, that might explain why American criminals prefer them over American bank issued cards, who have fraud detection in place 24x7. Read more...

StorefrontBacktalk
Our apologies. Due to legal and security copyright issues, we can't facilitate the printing of Premium Content. If you absolutely need a hard copy, please contact customer service.