Full Disclosure – The Response

by Administrator on October 20, 2006

Per the promise made to BVN and others, here is the full text of the response made by Bruce Moxon, Senior Director, Strategic Technology for Network Appliance — with additional comments by Dave Hitz, CTO Emeritus — to our questions from the previous post. Questions are included in this text, so you don’t have to go back to the previous post to find them.

Before beginning, I want to thank Bruce and Dave for their candor and comprehensiveness. I am told that there were voices inside Sunnyvale that questioned the efficacy or value of responding to Toigo’s blog. I am happy that cooler heads prevailed. The purpose of the original discussion of GX was to discover the truth about the architecture and operational characteristics of the product. That there were so many responses to the post was surprising and seemed to confirm that, like me, many were confused about this product.

Here are the responses in Q&A Format:

Preamble from Moxon:

Before jumping in and directly addressing your questions, I want to make a point about architecture and implementation. In the ensuing discussion, I will sometimes differentiate between these, as in “the architecture supports …”, or “the initial implementation employs …”. These are meant as points of clarification that I hope will give you both the near-term “how does it work” information you seek, but also some insight into how the architecture will evolve over time to meet a number of needs that it may not yet today.

The first thing you should know is that the ONTAP GX product currently shipping is a *true* integration of Spinnaker Networks’ scale-out NAS architecture and a subset of NetApp’s ONTAP, a highly resilient, highly functional microkernel operating system that has been at the core of NetApp’s products for over a decade. The subset of ONTAP that has been integrated thus far includes: the core container architecture (WAFL, ondisk checksums, raidgroups, RAID-DP, aggregates, flexible volumes), snapshots, and GX SnapMirror (within a GX cluster). Over time, additional functionality, such as FlexClone, SnapVault, and interoperable SnapMirror (with non-GX NetApp systems) will also be supported. New features, including load sharing mirrors (read-only replicas) and striped volumes (volumes that span multiple controllers and their disks) have also been added to the system, specifically to support high performance file system requirements.

The core Spinnaker architecture contributes the following key features, in line with its original design philosophy:

  • Horizontal scaling (scale-out), to facilitate the development of very large storage systems (in terms of both capacity and performance) from modular components. This allows us to pull multiple NetApp controllers (i.e. filers) into a single system, with data spread across the individual controllers within that system, and clients spread across many interfaces on the front-end that are logically all part of the same storage system.
  • True virtualization from a data access perspective – i.e. regardless of where a client connects into the storage system (which of a number of interfaces they’re bound to), they see a coherent view of the world. That manifests itself as a common namespace for NAS protocols, and will do so as a large addressable storage pool and set of LUNs for block protocols when that functionality comes online.
  • Truly transparent data migration. This affords storage managers (and eventually rule-based policy engines) to migrate data containers among controllers in the cluster with no change in the logical path to the data (file path, lun and LBA) and no disruption to the application – in the NAS case, even while files are open and locks are held.

As mentioned previously, the NetApp core storage architecture brings much of NetApp’s unique “DNA” to the table, including our highly resilient “container architecture” — WAFL, NVRAM, ondisk checksums, raidgroups, RAID-DP, aggregates, flexible volumes — and our unique approach to snapshots, mirroring, and (in an upcoming release) cloning. For more detail on NetApp’s “DNA” and why it is different, see CITE TO PDF and CITE TO PDF.

From an architecture perspective, ONTAP GX is what I would call a scalable switched storage architecture, a basic two-node instance of which is shown in figure 1.

Figure 1

Figure 1. Basic Switched Architecture

Clients are distributed across a set of virtual interfaces (VIFs) at the front end of the storage “nodes” in the scale-out configuration (i.e. scale-out “cluster”). A VIF is the GX representation of an IP/MAC address combination that provides client network access. At time of system configuration, VIFs are “bound” to physical storage controller networking ports. That binding can be dynamically updated, either manually, or automatically, in response to a network port or controller failure, or as part of standard storage cluster expansions. Such port migrations are nondisruptive to the client applications. A client connects to the storage system through a single VIF at time of mount. Mapping of clients to VIFs can be done by any of a number of means, including static (partitioned) assignments, DNS roundrobin, or through Level4 load balancing switches.

Data is spread across what you can think of as a “partitioned SAN” on the back end, with each node connected to a subset of the overall storage, and volumes residing in one (and only one) partition. The controllers themselves have two major functional components to them:

  • an N-Blade (network-facing) that talks file access protocols (NFS, CIFS) out the front, and an optimized scale-out cluster protocol (SpinNP) out the back; and
  • a D-Blade (disk-facing) that talks SpinNP out the front, and a storage protocol (in our case ONTAP/WAFL) out the back

The volumes that are physically distributed across the individual D-Blades are “stitched together” into a global namespace by connecting them in to a directory in a “parent” volume at time of creation. This concept is depicted in figure 2.

Figure 2

Figure 2. Global namespace construction from distributed volumes

When a client request is made to the N-Blade, a quick, in-memory lookup is done (in the VLDB, or volume location database) to determine which node has the data for the requested file, and the request is “routed” (through the cluster fabric switch) to the appropriate D-Blade. (Think of the VLDB as a “volume router” — analogous to a network router, which caches its maps in memory for *very rapid* switching). The target D-Blade makes the appropriate storage request, and returns the necessary information to the N-Blade, which then returns it to the client. This all occurs over the SpinNP protocol, a lightweight RPC protocol engineered from scratch by the Spinnaker team to ensure low latency and efficiency of storage network operations.

So, bottom line … you spread the clients out the front, the data out the back, and leverage a scalable cluster fabric that allows you to scale controller performance and end-to-end client and storage network bandwidth as capacity scales. Furthermore, because the individual storage partitions are themselves scalable – from both a capacity and performance perspective, the architecture can provide a wide range of storage system characteristics – from “wide and thin” (high performance, low capacity) to “narrow and deep” (higher capacity, lower performance).

The ONTAP GX system employs the same physical hardware that ships in our mid-range and high-end systems – the FAS3050 and FAS6070 controllers, and our standard FC or SATA drive shelves. This approach leverages the resiliency of well tested storage building blocks.

Furthermore, we currently implement ONTAP GX systems as shown in figure 3.

Figure 3

Figure 3. ONTAP GX Initial Implementation

This configuration leverages NetApp’s traditional “cluster failover pair” architecture (i.e. HA clustering) to provide resiliency at the controller level, in addition to that provided by the WAFL/RAID-DP features implemented at the storage level. This includes failover of VIFs in a specified manner (based on VIF failover rules). A redundant networking approach is then implemented – taking two ports out of each controller into *each* of the Client and Grid fabrics, and wiring each into a separate network fault domain (switch or fault-isolated VLAN).

The result is a robust architecture that leverages all of NetApp’s tried-and-true resiliency features, and extends the scale-out architecture using redundant networking, leveraging the best of both HA and scale-out clustering approaches in an integrated fashion.

1. GX is hailed as a clustered storage solution. Is this characterization correct? If so, please define what you mean by the term “storage cluster” as there seem to be different definitions floating around the industry. Are we talking about a “failover” cluster? Are we talking about a cluster that enables storage scaling – aka “grid” or “utility” clustering — or something completely different?

As you note, “clustered” is an often ill-qualified term that leaves open for interpretation the sorts of issues you raise. I prefer to use the term “scale-out” when describing the use of parallel/distributed architectures to effect scalable performance. That having been said, in ONTAP GX, we use *BOTH* HA clustering approaches AND scale-out approaches as described above. For more thoughts on the “g-word”, you may want to refer to my blog, in particular HERE and HERE.

2. Following on to the first question: is there a difference between storage clustering as it pertains to file clustering versus block storage clustering? Is such a differentiation, in your view, correct or specious?

I’ll take your question here to be directed at approaches to “scaling out” file systems and block storage systems. I think there has been a difference to date – largely because of the focus of implementations, but I think we are starting to see some convergence in this space. I call ONTAP GX a switched storage architecture specifically because it addresses (again, architecturally) both file and block access protocols in front of a scalable backend.

In essence, existing block architectures already do some form of clustering (distributing storage across multiple controllers, both for HA and for performance), but are limited in the scale they’ll support (within a box). The scalability of an individually addressable block device has also been less critical to date, in my mind – because the most common applications (databases) have evolved over the years to make effective use of lots of independent block devices (e.g. Oracle ASM). These databases are “clustered storage systems” in their own right — in both a performance/scale-out and HA context when teamed with redundant fabrics and multi-pathing.

I think you will see a blurring of these lines over the next few years, especially as iSCSI continues to proliferate. The result, I think, will be something close to what GX will deliver.

3. GX is said to embody elements of the technology forwarded by Spinnaker Networks, which was acquired by Network Appliance a few years ago. What specifically about GX has been derived from Spinnaker? Is there a direct lift of Spinnaker or is Network Appliance speaking conceptually or metaphorically about Spinnaker when they claim that certain elements of GX are taken from Spinnaker?

Fundamentally, the distributed N-Blade architecture, VIF failover, volume routing, and Spin-NP protocol come directly from the Spinnaker acquisition. The core NFSv3 and CIFS protocol stacks in ONTAP GX are also derived from the Spinnaker implementations. The NetApp contribution is the traditional storage container architecture (WAFL, raidgroups, aggregates, volumes) that also supports efficient snapshot and clone capabilities. The combined engineering teams have been working on new features, including load sharing mirrors, striped aggregates and volumes, and block protocol support.

4. I have always been curious about the inherent incompatibilities between Andrew File System (which I understood to be the core of the Spinnaker product) and Network Appliance’s Berkeley Fast File System-derived WAFL file system and how they were reconciled in the creation of this product. Can you cast some light on this question?

AFS itself (i.e. the code) is not at the core of the Spinnaker product. Rather, some of the architectural concepts of AFS (and other parallel/distributed file systems) influenced its design. But the team also improved on the original AFS architecture – for example, the original AFS system required a special client-side protocol stack, whereas the Spinnaker implementation did not. The AFS influence led us to develop a 2-stage filesystem as described above, where the client-accessible services were implemented in the N-blade, and the actual protocol-independent filesystem was implemented in the D-blade. In ONTAP GX (the integrated product), WAFL is used to implement the D-blade.

As Dave Hitz indicated to you in a separate note, WAFL is not based on the Berkeley Fast File System, but rather a new filesystem implemented from scratch.

In the above referenced note, Dave Hitz wrote:

WAFL is not based on the Berkeley Fast File System. It is very different. We did use a lot of code from the Berkeley release, like the whole TCP/IP stack, lots of low-level boot code and drivers, chunks from a variety of administrative commands and daemons, but WAFL and RAID we implemented from scratch.

When you ask about AFS and WAFL, things get confusing.

The word “file system” is tricky, because it refers to two very different things. There are “disk file systems”, whose main job is to convert logical file-based requests into block requests to disk, and there are “network file systems” whose job is to transport logical file-based requests from a client to a server somewhere else.

AFS is a network filesystem, like NFS. WAFL is a disk file system. The original AFS team also developed a disk file system called “Episode”. In some ways, WAFL is more similar to Episode than to Berkeley FFS, so it’s not surprising that the Spinnaker folks would find that it had the features they needed to develop AFS-like functionality.

5. From the whitepaper about GX released at the time of the product announcement, there is on page 8 (actual, not PDF file pagination) the following text: “When an NFS request comes to a Data ONTAP GX node, if the requested data is local to the node that receives the request (red dotted line in the figure), that node serves the file. If the data is not local (purple dashed line in the figure), the node routes the request over the private switched cluster network to the node where the data resides. That node then forwards the requested data back to the requesting node to be forwarded to the client. The routing of the request is completely transparent to the client. Each Data ONTAP GX node is only one hop away from any other node in the cluster across a low-latency private cluster network.”

a. Am I correct in my reading of this document that there is some sort of extra hop to data – a look-up to a metadata table or data layout table of some sort — for each NFS write request directed to the cluster? If not, please explain the routing of data to the correct destination node from the perspective of both (1) a simple write operation and (2) the load balancing capability claimed by Network Appliance for this product.

Yes, there is an extra “hop” for data access operations. As mentioned above, the client request is resolved through an in-memory VLDB lookup (“volume routing”) to the node that “owns” that volume. A SpinNP request is then made to the requisite D-Blade. The Spinnaker architects have described the SpinNP protocol as an RPC version of the original NFS Vnode layer, augmented to deal with multi-protocol access and locking.

Load balancing is accomplished at a couple of levels. First, client sessions are distributed across N-blades, and data across D-blades, so that a statistical distribution of activity across [user, data] sets is inherently load balanced. Next, client sessions can be dynamically (non-disruptively) migrated across controllers (VIFs) to alleviate client-side load imbalances. And data can be dynamically (non-disruptively) migrated across D-Blades to alleviate storage-side load imbalances. These latter two activities are currently manually initiated operations; they will be part of an automated, policy-driven monitoring and management system in the near future.

There are additional means of distributing load across the system – especially in high concurrency, high throughput applications (e.g.. technical computing).

  • Read-only replicas (we call them load sharing mirrors) are point-in-time, read-only copies of a volume that effectively increase the read “fanout” of the system. When a client read request comes in to a node that has a load sharing mirror of the requested volume, that data is served directly out of the same node’s D-Blade, rather than requiring the SpinNP “hop”.
  • Striped volumes (where a volume is striped across many controllers) cause files within that volume to be physically stored on the spindles behind multiple controllers, providing a means of distributing load for access to a single volume. Small files are “scattered” across the storage in each node, and large files are segmented and distributed across multiple nodes. Segment size is set at time of volume creation, and can vary from volume to volume. The result is concurrent access both to individual files, and to distinct segments of large files.

Dave Hitz sent me another email with the following additional input on Question 5a.

Wow. I don’t know about you, Jon, but I definitely learned some things in reading this response. Thank you Bruce!

There was one spot that it wasn’t clear to me whether Bruce answered exactly the question you were asking. Let me poke at that, and of course — since Bruce understands this stuff a lot better than me — I’ll rely on him to correct me if I screw up.

In question 5a you ask about “some sort of extra hop”, like “a look-up to a metadata table”.

If the data lives on the node that received the request, then there are no hops at all. If the data is on a different node, then of course there is one hop, to go get the data, but there is no extra hop above and beyond that first one. GX has a distributed database that pushes the “metadata table” to each node, so the “look-up to a metadata table” is local.

When Bruce said “there is an extra ‘hop’ for data access operations”, I believe that he was refering to the first, obvious hop that gets the data, not to any sort of extra hop required to find the data. That kind of extra hop would be required if you had an architecture with a centralized meta-data node that all other nodes needed to consult. In a distributed database architecture, you only need the one hop to get the data. As you obvserve, the centralized architecture has issues with both scaling and with failure modes.

b. If I am correct that some sort of extra hop is required to place data (and possibly to retrieve data) from this product, what does this mean in terms of (1) performance as additional nodes are added to the configuration and (2) the availability of the configuration and its vulnerability to single points of failure?

The implications are that you need a scalable interconnect as the Grid/Cluster fabric. Such technology is commonplace these days – in the form of scalable switch technology. We are currently using GbE and 10GbE, and are looking at the possible advantages both native IB and IPoIB/SDP implementations of SpinNP may provide.

As for SPOF, we work with our customers to ensure they deploy a fully redundant networking infrastructure for the grid/cluster fabric – including redundant switches or fault isolated VLANs.

Dave Hitz added the following:

In question 5b you ask about the impact of the extra hop on scalability as you add extra nodes.

Because this is a distributed database architecture, there is essentially no impact from adding extra nodes. You have at most one hop no matter how many nodes. You do have to distribute the database to more nodes, but it’s a small database that doesn’t get updated often, so impact is low.

One performance impact of scaling is that data is more likely to be on a different node than the one that received the request. Assuming that requests are completely random, you’ll go off-node half the time in a two node system, but 90% of the time in a 10 node system. (Of course, things often are not random, and with smart management you can move data to the nodes that get the most requests.)

For performance tuning, the engineers generally arrange the tests so that 100% of the requests go off-node. (Of course, they also optimize same-node requests, but that’s the easier case.)

When doing the giant SPEC-SFS result, they targeted the system size required by testing two node systems with 100% off-node traffic, and then doing simple math to figure out how big a system would be required. They saw almost zero degredation as the scaled to the full-sized cluster. To me that linear scaling was probably the most impressive aspect of the whole exersize.

Of course, the more nodes you have the more bandwidth you need between them, so you obviously need to scale the network interconnect appropriately.

Okay Bruce, fire away if I’m all confused about this. :-)

c. What does Network Appliance see as the potential value of Parallel NFS extensions to NFSv4 as an adjunct, enhancement or improvement in IO hopping in the current GX architecture?

pNFS effectively moves the volume lookup operations into the client, which caches this information from the pNFS “metadata server”. This removes the need for the “hop” during file access. I’ve put “metadata server” in quotes because the pNFS protocol per se doesn’t indicate how this might be implemented. Many people have horror stories about resource-constrained/overloaded out-of-band metadata servers for SANs. pNFS anticipates the development of scalable metadata servers that speak that part of the pNFS protocol.

In our case, ONTAP GX is itself already a parallel metadata server (with the information distributed across volumes spread across the D-Blades), in addition to a parallel data server. GX therefore provides an excellent foundation for pNFS implementation.

6. Is hop count and hopping methodology important when considering a cluster storage solution? What other criteria should guide consumers to choose one storage clustering product or another?

It can be, just as client network latency can be important in delivering appropriate service levels to clients from any networked storage system. A well engineered and provisioned grid/cluster fabric is important to get the most out of ONTAP GX, just as client networking configurations and SAN configurations are key in their respective implementations.

7. Regarding your SPEC.org tests published in concert with the announcement:

a. I want to confirm that you claimed to have realized 1,032,461 ops/sec to demonstrate that the GX is capable of supporting the IOPS required in an HPC environment. Is this correct?

Yes, it’s true we have realized in excess of 1M SpecSFS ops. I would not necessarily characterize that as a demonstration that “GX is capable of supporting the IOPS required in an HPC environment”. As I’m sure you know, many HPC environments require significant sequential I/O performance for large reads/writes – either in addition to, or instead of high aggregate random access performance. The SpecSFS benchmark results clearly don’t address that type of workload. For large sequential I/O workloads, we are happy to have our customers benchmark GX with workloads that are representative of their processing.

Some “technical applications” however, do benefit directly from this type of performance. Many EDA applications, and many largescale software builds, both of which generally fall into the technical computing category, do benefit from the scalability of smaller read/writes and other NFS ops (getattr, lookup).

Additionally, you might be surprised at the number of “HPC apps” I’ve seen that use standard Linux/Unix buffered I/O (which does 4KB reads/writes under the hood). Lots of HPC processing “pipelines”, e.g., include extensive use of Unix shellscripts and Perl. All of these use standard buffered reads and writes – at 4KB apiece.

b. In the configuration tested, the cluster submitted comprised a huge disk configuration as well as a large computing front end, suggesting that given enough resources and budget GX can achieve the IOPS required in a high performance compute environment.

The Configuration tested comprised:

  • Drives: 2016 – 73GB 15K drives / FC
  • Disk Control: 2 per Node (48) – FC / Qlogic
  • Nodes: 24 Nodes / 4 Processors each = 96 Processors (Opteron)
  • Memory: 32GB Cache per node = (768GB)
  • Network: 24 Network Connections – Client – 1 per Node (GIGE Jumbo)
    48 Cluster Connections – Cluster – 2 per Node (GIGE Jumbo)
  • File System Config: 120 RAID-DP (Double Parity) groups of 16 disks each
  • Total/Raw: 72Gb * 2016 = 145TB’s
  • Protection: Double Parity
  • Usable: 72 TB’s
  • Drives per Node: 42
  • File System (Drives): 14
  • File Systems per Node: 3
  • Name Space: 1 Name Space – addressed as 1 FS mount point.
  • Total File Systems: 72 (3 X 24 nodes): As described by the launch articles / SPEC.org results, all file systems are administered separately from the GX Name Space. This results in the management of 72 file systems across 24 nodes separately from managing the Name Space.
  • Max File System Size: 16TB per volume.
  • File System/Protocol Support:
    TCP/IP = TCP results published / UDP Results – assume 10% increase
    CIFS = NO
    AFP = NO
    NFS = YES / WAFL

This should be an accurate depiction of the configuration as it is abstracted straight from the SPEC web site. If not, please clarify any errors.

I haven’t checked your (presumably cut-n-pasted) config against the Spec website, but if you took this straight from the Spec website, they are accurate. Spec includes a formal submission process coupled with peer review.

c. The SPEC.org configuration was set up as 24 nodes in one large cluster, with separate file systems joined by the GX Name Space, which I understand to be a Global Name Space that virtualizes the access to file systems. The GX system tested used the 6070A filer platform, which Network Appliance says has the ability to scale to 6 Petabytes in capacity — a theoretical limit, I believe, and a measure of pure disk space that has not been validated. The spec results show a DP platform with raw numbers @ 145TB’s. If these assumptions are accurate, then the following questions can logically follow:

(1) Is GX clustering creating a bifurcated system, where the Name Space and the file systems are disconnected from one another?

No. There are some semantics here that Spec is preserving in order to keep vendors (including NTAP) honest. Specifically, they characterize the volumes created in a GX system as individual “file systems” that are then aggregated under a single namespace. The reality of GX is that volumes are “parented into” the namespace upon creation. The result is a single mount point that provides access to all of the data across all of the volumes in the storage system.

(2) If (1) is true, doesn’t this imply that a new order of management capability will be required – to manage the Name Space, the File Systems of individual Filer nodes, as well as massive numbers of RAID-DP groups required to reach the maximum capacities touted by Network Appliance?

ONTAP GX does require the management of aggregates and volumes that are spread across the individual controllers. That capability is provided by our current ONTAP GX management suite. Additionally, we are in the process of developing more comprehensive tools that allow provisioning “templates” to be quickly applied to common provisioning tasks in a data-centric manner. These templates can be applied to quickly create common directory structures (e.g. for different users or different aspects of a project) that instantiate multiple volumes at once – greatly simplifying the task of provisioning storage in GX.

(3) As a practical matter, what do you see as the overall cost of such a platform in terms of hardware investment, software costs (including management software, assuming that there is cluster management software that is capable of managing this configuration), and soft costs including labor, electricity, etc.?

Not sure I can get too specific on that front with you. I can tell you that our GX clusters are fashioned from the same building blocks as are our standard filers (i.e. FAS3050 and FAS6070). And we do understand the economics of the HPC/technical computing market; we expect to be able to deliver competitively priced systems.

d. The SPEC test, of course, does not get into comparisons of performance between different products (even from the same manufacturer). However, your published performance data on the 6070 platform (SPECsfs97_R1.v3: NetApp GX 6070A 5/2006) provided a result of 1032461 Ops/Sec (Overall Response Time = 1.53 msec average) that simple math suggests would produce 43019 IOPS per node in a 24 node environment – that is, in one NOT running ONTAP GX or its Name Space. Your performance with ONTAP GX in the clustering test is a fraction of that number. So, is it correct to assume that you lose between 36 and 58% IOPS efficiency when you add the “cluster overhead” of ONTAP GX? If not, where are all the IOPS going?

So, given my answer to the previous question, you could “do the math” and find that the overhead is ~36% (not sure where you get 58%) on Spec ops. Keep in mind that Spec ops are skewed towards the “short transaction” end (especially for “dataless” ops like getattr) and will thus show a higher overhead than will large sequential ops. Our engineering target, which we are still striving for and expect to meet, continues to be performance parity between 7G and GX on a per-node basis, with a 15% overhead for “remote” I/O operations (i.e. SpinNP). Customers deploying GX find the additional functionality – global namespace, capacity and performance load balancing, load sharing mirrors, and striped volumes for enhanced single volume and single file performance – sufficiently compelling that that tends not to be an issue. One can *always* get greater performance out of a purely partitioned (shared nothing) architecture – at the cost of application complexity and decreased manageability.

e. I don’t see random IO performance being tabulated in these tests. Do you have any stats on GX cluster performance operating with a random IO workload that you can share? Does random IO significantly reduce/increase the performance of the GX cluster?

SpecSFS is, in fact, more of a “random access” benchmark than a “sequential large read/write” benchmark. If you mean by your question random 4K reads/writes in very large files, then I agree, we haven’t posted anything that represents that (nor has any other vendor of which I’m aware). Customers for which such workloads are important typically have their own benchmarks, or run appropriate iozone parameterizations.

8. Besides support for CIFS, it seems that a lot of other functionality absent in ONTAP GX that is provided in non-clustered ONTAP. I have a lengthy list in hand. Moreover, I have been told that GX is not recommended in cases where customers have need of CIFS, block services (iSCSI or FC), databases or business applications. Is all of this correct? If so, what does this mean to the consumer – what is the target application environment for this product? How is it different, say, from a solution intended to enable scalable capacity with high availability in a conventional Windows and Unix business?

GX is a multi-release development at NetApp. The initial version, ONTAP GX 10.0, released earlier this year, is targeted at what I call “production-oriented technical computing applications”. Those applications are typically NFS-centric, with some CIFS access required (for data pre-processing, post-processing, and visualization).

In fact, we do currently support scalable CIFS on GX. However, there are some CIFS functions that “enterprise” customers need (quotas, GPOs, folder redirection and synchronization, integrated anti-virus, …) that are not currently supported in GX CIFS today. As many of those customers are currently very well served with our standard ONTAP 7G offering, we wanted to be sure and clearly indicate to them that 7G is the appropriate platform for them – for now. As some of those additional functions are brought online in future GX releases, we expect enterprise CIFS customers with scalability needs to move to ONTAP GX.

As for iSCSI and FC, that’s an easy one. Those protocols are not yet supported on GX; they will be in future releases. Because many customers equate “blocks” with “database applications”, and furthermore, because some commercial databases can already leverage block devices served from multiple servers/controllers (e.g. Oracle ASM) or multiple mounts (Oracle/NFS), we do not think that GX currently provides sufficient additional scalability/manageability benefits to those environments. Over time, and as blocks protocols come online, we expect the single system image, dynamic load balancing, nondisruptive migration, and other GX features to attract database applications as well.

9. Please highlight the availability guarantees of the GX. What architecturally is being done to protect data placed on this platform?

The data itself is protected using the same mechanisms that are in place on NetApp’s worldwide installations: i.e. WAFL, RAID-DP, and snapshots. We leverage the same active-active controller failover mechanism used on all of our clustered filer deployments. Those standard ONTAP resiliency features are augmented with VIF failover and a redundant/fault tolerant network configuration, as described above. Finally, we support explicit replication of data (volume mirroring) across nodes and NDMP backup as data protection features.

10. If you are still with me, and can comment, I’d like to know whether you believe these questions to be at all relevant to an informed consumer purchase of GX? If not, what should be moving consumers to your product offering versus those of competitors?

Still here … hope you are too. Yes, I think these questions are all relevant. They speak to the fundamental GX architecture and its ability to deliver scalable, high performance storage for a variety of workloads, and to do so in a manner that is highly resilient and operationally tractable.

I’d add that operational scalability is one of my favorite topics — not only from a storage perspective, but from an application perspective. I.e., how do your storage systems help streamline your operations – whether application provisioning, management, or monitoring. I think NetApp’s contributions in this space with ONTAP 7G are second to none. Flexible Volumes, Snapshot and FlexClone technology, and a tremendous variety of replication and data protection products (SnapMirror, SnapVault, and SnapLock and LockVault for retention/compliance requirements), coupled with recent additions in security of data-at-rest (Decru), and in VTL, provide a strong portfolio of capabilities that our customers rely on to simplify their operations.

ONTAP GX is a new “engine”, if you will, for delivering those same features on a scale-out platform. Quite simply, we believe “scale-out” is the way of the future, and that this approach allows us to build more capable storage systems more economically and more quickly than would be afforded otherwise (i.e. scale-up).

I have to say, as you can see from the length of this post, that Moxon went out of his way to illuminate aspects of the GX offering that had me and many readers confused. Does this response completely resolve all questions. Probably not. In my telephone call with Bruce this morning, I gave him a few more questions suggested by his response. Those questions, he assured me, will also be handled by him personally.

For the record, here are the new questions. I will post the responses when received.

1. The first question refers to the balancing of requests to multiple Nblades in Figure 1

a) If Nblade 1 is processor bound or traffic bound, how are requests rerouted to Nblade2?

b) If the Fibre Channel connection between Nblade and storage shelf is unavailable, how do Nblade 2′s requests get answered by the storage on Nblade 1?

c) If Nblade 1 is unavailable because of traffic load, how does Nblade 2 communicate to storage that is now unavailable to Nblade 1?

2. Load balancers work by redirecting requests to available servers based on processor load, number of requests, number of users etc. Usually, the least busy server gets the next request which goes to a non routeable server behind the load balancer.

a) I understand how the NBlades are virtualizing their Addresses with this layer of virtualization called the VLDB. But if the VLDB gets corrupted, how does the requesting server recognize where to send its request and where it’s response comes from?

Since it seems that the reads and writes into a file system have to go to a distinct file location in order for the files structure to recognize where the parts of the file are, without the VLDB to manage name space and file space, can a server find information?

b) How does client read or write information from a fibre channel shelf that is no longer attached to a functioning Nblade? Can the information once on a VLDB system be unvirtualized or is information that is once captured to the VLDB a “prisoner for life?”

I’m sure that Moxon has better things to do than to respond to all these questions, but he told me today that he championed the idea of making a full response because these are exactly the kinds of questions he expects salespersons to get from customers. He is officially on my Christmas Card list.

{ 5 comments… read them below or add one }

Richard October 21, 2006 at 7:00 am

As per Dave’s comment the critical issue is ” of course, the more nodes, the more bandwidth you need between them” i.e. the bandwidth and the associated latency over the switch.

The other issue is the caching of “writes” ….i.e. how does the cluster maintain write coherency across all of the D Blades, each with its own cache & private backend.

Is there a need to replicate writes to all of the D blades.. how does the locking work… or is this always a write-through system ?

John October 23, 2006 at 12:17 pm

Jon,

It’s nice to see NetApp being open about its GX architecture. This being said, the architectural deficiencies were clearly spun in a positive light with marketing hype.

To get to the bottom line of the hop issue, I believe NetApp should answer one fundamental question that will be difficult to spin:

How long does it take to read a large file if it is stored on one node, and then how long does it take to read the same file if it is striped across 24 nodes.

Based on my understanding of the architecture, the time it takes to read the large file from 24 nodes will be longer than reading it from one node.

Reasoning:

One Node: Request goes to N-Blade, entire file contents are read from its D-blade and returned to the client that requested it.

24 Nodes: Request goes to an N-Blade. An in-memory lookup is done to determine which nodes have the data for the requested file. The requests are “routed” (through the cluster fabric switch) to 24 D-Blades. The N-Blade waits for 24 D-Blades (including its own) to respond with its portion of the file, and then the N-blade aggregates the file chunks together and returns it to the client. This clearly takes longer than reading the file from a single node.

Another way to get at this issue is to have NetApp publish the latency of a read operation for a file on a single node, and the latency of the same read operation when the file spans multiple nodes.

Bottom line: Performance can decrease for certain workloads as nodes are added to the cluster.

The following summary of GX’s deficiencies “quote, un-quote” is somewhat starling and leaves me with the impression that GX is more of a prototype than a commercially viable product:

“ONTAP GX does require the management of aggregates and volumes that are spread across the individual controllers. That capability is provided by our current ONTAP GX management suite.”

“As for iSCSI and FC, that’s an easy one. Those protocols are not yet supported on GX”

“However, there are some CIFS functions that “enterprise” customers need (quotas, GPOs, folder redirection and synchronization, integrated anti-virus, …) that are not currently supported in GX CIFS today.”

“These latter two activities are currently manually initiated operations; they will be part of an automated, policy-driven monitoring and management system in the near future.”

“If you mean by your question random 4K reads/writes in very large files, then I agree, we haven’t posted anything that represents that (nor has any other vendor of which I’m aware). Customers for which such workloads are important typically have their own benchmarks, or run appropriate iozone parameterizations.”

“Over time, additional functionality, such as FlexClone, SnapVault, and interoperable SnapMirror (with non-GX NetApp systems) will also be supported.”

John

GridGuy (Bruce Moxon) October 25, 2006 at 1:53 pm

I’ve posted a more detailed response on my blog at

http://gridguy.net/?p=16

But in summary …

(1) I think you fundamentally miss the point of how technical computing applications use FILESYSTEMS (not just block devices) to provide AGGREGATE performance to many clients (hosts). See my blog entry for more details.

(2) As mentioned, ONTAP GX is initially focused on providing enhanced scalability and performance for technical computing applications. These include applications such as digital animation, bioinformatics, electronic design automation, seismic processing, etc. All of these applications are driven by scalable FILESYSTEM requirements – which ONTAP GX addresses, and which are not addressed directly by blocks-based storage (FC or iSCSI).

NetApp’s line of enterprise products, running ONTAP 7G in a scale-up (redundant multiprocessor controllers) architecture already support individual system configurations up to 500 TB with benchmark-validated performance (SpecSFS, TPC-C, etc.). Additionally, most database and mail applications already support multi-LUN (and multi-host) configurations (e.g. Oracle ASM). NetApp’s integrated data management suite already provides the tools to provision and manage multiple storage systems (“filers”) in support of a single large application with very high performance block storage requirements.

Finally, it’s important to understand that ONTAP GX is the FOUNDATION for MUCH larger multi-protocol storage systems – 10s of PBs and 10s of GB/s and beyond – leveraging the same hardware components, on-disk data format, and industry-leading suite of data management tools that have been successfully deployed in ONTAP 7G systems in the enterprise.

b

GridGuy (Bruce Moxon) October 25, 2006 at 6:26 pm

Richard,

In response to your question about caching …

Caching in the GX architecture is done at the D-blade, and directly leverages the caching approach common to all NetApp filers. The D-blade cache is co-resident with the disk that it caches. I.e. it is a PARTITIONED architecture where volumes (or volume segments for striped volumes) are “owned” by one and only one D-blade – and that D-blade caches only the data on disks that it “owns”.

During writes, the write hits the D-blade, is written to memory and NVRAM, and then acknowledged back to the client through the N-blade. This is the same way writes work on standard filers, except that here, the ACK is sent back through the N-blade. Once acknowledged, writes are NVRAM-backed until they are physically written to disk, ensuring that all ACK’ed writes get to disk when the node comes back up. (Note that, if the storage node were to die after the D-blade write, but before the N-blade could ACK the write to the client, it is no different than a client dying or being disconnected from a standard filer after the write, but before the ACK was received).

As for concurrent access, any client reading the same data block just written by another client can safely be given the in-memory (cached) copy of that block by the D-blade, as the NVRAM guarantees that block will be written to disk.

Administrator November 6, 2006 at 1:43 pm

Bruce, I finally got around to reading the more extensive response to the issue posed by John above posted on your blog. However, in fairness, I can’t see where you answered the direct question that was put by the reader:

“How long does it take to read a large file if it is stored on one node, and then how long does it take to read the same file if it is striped across 24 nodes.

Based on my understanding of the architecture, the time it takes to read the large file from 24 nodes will be longer than reading it from one node.”

I think I understand the distinction you are drawing between clustering at the hardware level to support block ops and clustering at the file system level, but I am failing to see your direct response to the man’s question anywhere in your explanation. Forgive me if I am being a Flashing 12, but if you could answer the question directly, it would really help me out.

Also, I never got your responses to the follow-up questions posed above:1. The first question refers to the balancing of requests to multiple Nblades in Figure 1

a) If Nblade 1 is processor bound or traffic bound, how are requests rerouted to Nblade2?

b) If the Fibre Channel connection between Nblade and storage shelf is unavailable, how do Nblade 2’s requests get answered by the storage on Nblade 1?

c) If Nblade 1 is unavailable because of traffic load, how does Nblade 2 communicate to storage that is now unavailable to Nblade 1?

2. Load balancers work by redirecting requests to available servers based on processor load, number of requests, number of users etc. Usually, the least busy server gets the next request which goes to a non routeable server behind the load balancer.

a) I understand how the NBlades are virtualizing their Addresses with this layer of virtualization called the VLDB. But if the VLDB gets corrupted, how does the requesting server recognize where to send its request and where it’s response comes from?

Since it seems that the reads and writes into a file system have to go to a distinct file location in order for the files structure to recognize where the parts of the file are, without the VLDB to manage name space and file space, can a server find information?

b) How does client read or write information from a fibre channel shelf that is no longer attached to a functioning Nblade? Can the information once on a VLDB system be unvirtualized or is information that is once captured to the VLDB a “prisoner for life?””

I don’t want to be a pest and I am sure you have more important things to do, but I would really like to get your answers to these last few questions.

Thanks in advance.

Jon

Previous post:

Next post: