Founded in 2004, Joyent developed an innovative platform that enables teams to effectively communicate and collaborate with email, calendaring, contacts, file sharing, and other shared applications.
However, on 12 January the company’s success took a sharp noise dive as both its Bingodisk and Strongspace data storage solutions failed - and failed miserably indeed. The services were offline for the better part of 10 days. Joyeur customers couldn’t even access their accounts, no less the files that were contained on the backend Sun x4500s.
The company has since ceased taking on new customers for Bingodisk and Strongspace. But, the more alarming message from this is contained within the company’s explanation of what happened, as posted to the blog:
Was there a backup?
Yes, and no. In the traditional sense of us writing the data from Bingodisk and Strongspace to tape or some other Thumper, no, there was no backup. Data redundancy is built into the ZFS/Thumper software/hardware combination. The Thumper is both server, and backup. Moreover, it’s hard to see how a backup of 18TB of data to another physical device would work, in practice. Moving Bingodisk to another Thumper during this crisis took 30 hours (3TB of data). A large, multi-tenant service such as Bingodisk or Strongspace with the amount of data they manage makes it practically impossible to do a meaningful backup. A single backup would take over a week. The backup process would kill end-user performance. A service like Strongspace, which people use to rsync their own backups, means the data turns over rapidly and an incremental backup would not make sense.
While that seems to be a perfectly acceptable explanation given the environment these services are running within, it gave me a sense of satisfaction knowing that when I discuss System Center Data Protection Manager with customers and how it can be used to perform live backups of Microsoft Exchange Server, SQL Server and SharePoint Server instances, that the underlying technologies overcome the challenges that are being blamed for the Joyeur system outage. The case is made even stronger when you incorporate System Center Operations Manager 2007 for monitoring and alerting and System Center Configuration Manager 2007 for patching, updating and general systems management.
I know that we’re not comparing apples to apples here. A Microsoft-centric environment is completely different from what the folks at Joyent were using to host their services. But, in theory, could a DPM-like solution provided the company the necessary backup solution that now and in the weeks to come will have seemed worthwhile?
More importantly, what is your organisation doing to ensure that the same catastrophic failure doesn’t occur on your mission critical systems?
Recent Comments