https://slotsdad.com/ - casino online slots

Another Happy Customer

by Administrator on May 19, 2008

This just in as a comment to a much earlier post.  Thought I would promote it to a topic so that folks can offer this poor fellow their advice.

We just recovered from a failure of our NetApp appliance. evidently, a Checksum error caused WAFL metadata corruption and the entire NAS-head was offline rebuilding for over 24 hours. This type of performance from an appliance touted as Highly Available is unacceptable. NetApp communicated that they have seen this before and offered us a firmware fix. I am shocked that their support organization didn’t make every effort to communicate such a potentailly damaging problem proactively. How can we trust a device with mission critical data in the future?

Has anyone else had this problem?  He doesn’t say which model, version of ONTap or anything else that might illuminate us regarding the equipment that failed.  But here it is…

{ 11 comments… read them below or add one }

draft_ceo May 19, 2008 at 9:15 pm

What is the value in this story other than to single out NetApp? Doesn’t very storage product have a few such issues. It will be amazingly expensive to produce a zero-defect product.

gknieriemen May 19, 2008 at 10:10 pm

draft_ceo: Then why not call it “usually available” instead of “Highly Available”?

Administrator May 20, 2008 at 8:19 am

I have two reasons for promoting this post to a topic:

1. Is this an issue endemic to all users of the product or an isolated issue? I want to know this because it goes to the heart of the product value proposition as highlighted by the consumer. His storage was promised to be highly available as a key selling point, only it isn’t.

2. Last week in Chicago, I heard from yet another Centera customer (another large hospital services provider) who spoke of a nightmarishly-lengthy rebuild time when he experienced a drive failure on that platform. So, NetApp is not alone to be sure. It just so happens that NetApp positions its stuff as an alternative to Centera and beats up on such issues as downtime when it makes its pitch to do NetApp not CAS. Is their alternative just as vulnerable?

Not just picking on NetApp here. Trying to get the scope of the problem.

uslacker May 20, 2008 at 11:51 am

We have about 40TB of on-line netapp storage (950, 3020, 3050) and have had nothing other than drive failures in the past three years.

\\uSlacker

SysDoc May 21, 2008 at 12:10 am

Watermelons and raisins!

From your comments I “think” I understand what you’re trying to accomplish here Jon, but IMHO you picked a very poor comment as your catalyst for the discussion.

I’ve managed 100’s of storage arrays over the years, including many NetApp filers and their failure rates are no different than their competitors’. You can find recent examples of Tier1 RAID arrays causing major production downtime anywhere if you look hard enough.

The major storage suppliers all do a great job of documenting best-practices for availability and customizing / coaching you through them. This includes proactive alerts about potentially damaging bugs. In all cases I’ve found keenly understanding and religiously following a vendor’s best-practices has a dramatic impact on reducing unexpected outages. It looks to me like “jmazzaro” didn’t do either and is merely throwing NetApp under the bus.

Ultimately, it seems to me you’re confusing availability with data integrity.

On the availability front, disk failures are routine with all platforms, yet the performance degradations vary. With NetApp (using RAID-DP) single disk failures are barely noticeable since the rebuild is a low priority process unless a 2nd drive fails. OTOH – Routine disk failures on EMC Centera are akin to denial of service attacks since the system becomes unusable until the objects are reconstructed on the remaining disks or nodes.

On the data integrity front, System and Data consistency checks on both platforms are extremely rare. The major difference is that with NetApp you’re aware data has potentially been lost and the WAFL scan attempts to rebuild it or tell you what has been lost. With EMC’s Centera clusters, data loss is silent. The system never tells you of data lost due to hash collisions, API bugs or self-healing delays overflowing ingest queues. Many departments I know have to run weekly maintenance scripts from the archiving apps to reconcile what Centera stored with what was expected by the app. It’s a little shocking how often those two indexes don’t match!

OzzieJohn May 21, 2008 at 7:14 am

I know enough about NetApp kit to know that these kinds of problems are rarely seen, and when they are seen, they are usually found in environments that don’t follow best practice. For example for an entire NAS to be down, its likely that the system is not configured according to best practice e.g.

1. All data was installed on a single aggregate, with no dedicated root volume

2. Not running active/active controllers (though corruption of one volume/aggregate still might cause downtime for systems depending on data held on that volume/aggregate),

3. Aggregate level snapshots were turned off (this provides a very fast recovery method for most file system level problems)

4. Not keeping the array software up to date.

On that last point, from the sound of the problem, and the fix, I suspect the customer was affected by a firmware bug that NetApp identified in its ATA controllers in March of last year, was communicated then, and then re-enforced in a subsequent notification in August.

as far as his statement that “I am shocked that their support organization didn’t make every effort to communicate such a potentially damaging problem pro actively”, I suspect the poor guy either didn’t get the notifications because his contact details weren’t correctly entered in the NetApp support database, or because whoever got the notification didn’t get around to doing anything about it. From my experience anything that might cause significant downtime or data loss is communicated pretty quickly.

As far as a downtime of 24 hours goes, I agree that this kind of thing is unacceptable, however I notice that he said that he had just recovered after 24 hours, but thankfully there was no mention of data loss. This is consistent with both my experience and that of others I have spoken to. I have never, not once, ever heard of a customer losing data on a NetApp array.

My experience of recovering from this kind of problem using other arrays (and I’ve seen firmware bugs cause all kinds of nasty problems on different vendors arrays) usually involve restoring data from tape, a long, stressful period of downtime, and a corresponding loss of 12 – 24 hours worth of data too, which is usually much more serious than the downtime alone.

I hope the guy who reported the problem takes the time to get some advice on how to set up his Array correctly to achieve the highest levels of availability, or if he’s pissed enough at NetApp to buy a replacement that whoever he buys it from helps him to advise him on how to setup that array to avoid potential pitfalls

Regards
OzzieJohn

Administrator May 21, 2008 at 8:49 am

Folks, I have no intention of dissing NetApp on this issue. From what I have gleaned from the broad majority of users of their gear, it does the job they bought it to do. I wasn’t sure whether either software or support had degraded and whether this was indicative of emerging problems. Sounds like we still have a large cadre of happy shiny customers.

Again, when I promoted the comment to a post, it was mainly to reach out to lots of readers to see whether there were concerns out there that were going unstated. A lot of times, folks think that their problem is unique to them, when in fact it is part of a larger problem.

Not so in this case is my sense of the comments thus far.

TimC May 21, 2008 at 9:34 am

Well… assuming in this case he REALLY wants an answer and this isn’t just a NetApp bash, I can almost guarantee it was one of two things:

1. he never signed up for a NOW account.

2. he signed up for a NOW account, and decided to uncheck the little box on sign-up that opts you in to recieving service bulletins (WHO WOULD WANT THAT SPAM!?)

I know exactly what issue he’s talking about, and it was fixed over 6 months ago. There were at least 2 emails that went out urging you to upgrade your firmware immediately or face this corruption, as well as a giant blinking alert on the NOW site itself.

Then again, I’d question if that’s from a real NetApp customer at all with the wording and inflammatory nature. Not to mention the lack of any real information. Sounds more like a competitors sad attempt at trying to stir the pot to me.

ericmb May 27, 2008 at 6:25 pm

Thought I d shed a bit of sunshine to this cynical blog. I know the weather is shit in NL but come on..

This blog reminds me of my tech support days. all you hear about is problems.. you never see teh good things cause hey.. they re not mentioned.

My current assigment is with a massive telecom and yesterday I had a guy walk up to me regards performance problems with a fas980. I wont go into details but this netapp filer had been up and running for 660 days.

you just dont get these stories on this blog do you?

PS: summer 2003 in NL was great!

Administrator May 27, 2008 at 6:42 pm

I am only a little bit cynical, Eric. And I will and have publish(ed) testimonials from happy shiny consumers whenever I get them, including yours!

I do seek out two things here:

First, I want to understand and to ask questions.

Second, I want to know when vendors are BSing me, which they seem to do a lot…or at least often enough to make me cynical.

Lots of blogs out there written by the vendors themselves if you just want the party line.

andriven June 2, 2008 at 11:53 am

Yes — this is something of a ditto entry but….

The description does sound VERY much like the issue fixed in the critical firmware revision for the AT-FCX modules a while back.

The NOW site is quite good as well as the automated email alerts/ticket opening stuff.

Knock on wood but we now have 80%+ of our ~50 TBs of data into a NetApp clustered head arrangement and have had no downtime outside of planned outages for power (i.e. UPS upgrades) or major NetApp hardware upgrades in the last 5 years (i.e. stuff we prefer to plan as an outage late at night).

I’m the only NetApp admin here and while I do keep up with best practices and version updates I only spend roughly 20% or less of my time on everything NetApp (including day to day LUN/volume creation, etc.).

Previous post:

Next post: