Just when you thought things were wrapping up on the IP Storage Working Group reflector at IETF (where iSCSI was hatched), a cat fight broke out while I was on the road. Subject: Fibre Channel over Ethernet, being developed over at ANSI T-11. Here is the thread, which begins with this note from a guy who I regard as a king in the storage protocol world, Julian Satran at IBM:
The trade press is lately full with comments about the latest and greatest reincarnation of Fiber Channel over ethernet. It made me try and summarize all the long and hot debates that preceded the advent of iSCSI.
Although FCoE proponents make it look like no debate preceded iSCSI that was not so – FCoE was considered even then and was dropped as a dumb idea.
Here is a summary (as afar as I can remember) of the main arguments. They are not bad arguments even in retrospect and technically FCoE doesn’t look better than it did then.
Feel free to use this material in any form. I expect this group to seriously expand my arguments and make them public – in personal or collective form.
And do not forget – it is a technical dispute – although we all must have some doubts about the way it is pursued.
What a piece of nostalgia 🙂
Around 1997 when a team at IBM Research (Haifa and Almaden) started looking at connecting storage to servers using the “regular network” (the ubiquitous LAN) we considered many alternatives (another team even had a look at ATM – still a computer network candidate at the time). I won’t get you over all of our rationale (and we went over some of them again at the end of 1999 with a team from CISCO before we convened the first IETF BOF in 2000 at Adelaide that resulted in iSCSI and all the rest) but some of the reasons we choose to drop Fiber Channel over raw Ethernet where multiple:
Fiber Channel Protocol (SCSI over Fiber Channel Link) is “mildly” effective because:
- it implements endpoints in a dedicated engine (Offload)
- it has no transport layer (recovery is done at the application layer under the assumption that the error rate will be very low)
- the network is limited in physical span and logical span (number of switches)
- flow-control/congestion control is achieved with a mechanism adequate for a limited span network (credits).
- The packet loss rate is almost nil and that allows FCP to avoid using a transport (end-to-end) layer
- FCP switches are simple (addresses are local and the memory requirements cam be limited through the credit mechanism)
- FCP endpoints are inherently costlier than simple NICs – the cost argument (initiators are more expensive)
- The credit mechanisms is highly unstable for larger networks (check switch vendors planning docs for the network diameter limits) – the scaling argument
- The assumption of low losses due to errors might radically change when moving from 1 to 10 Gb/s – the scaling argument
- Ethernet has no credit mechanism and any mechanism with a similar effect increases the end point cost.
- Building a transport layer in the protocol stack has always been the preferred choice of the networking community – the community argument
- The “performance penalty” of a complete protocol stack has always been overstated (and overrated). Advances in protocol stack implementation and finer tuning of the congestion control mechanisms make conventional TCP/IP performing well even at 10 Gb/s and over.
- Moreover the multicore processors that become dominant on the computing scene have enough compute cycles available to make any “offloading” possible as a mere code restructuring exercise (see the stack reports from Intel, IBM etc.)
- Building on a complete stack makes available a wealth of operational and management mechanisms built over the years by the networking community (routing, provisioning, security, service location etc.) – the community argument
- Higher level storage access over an IP network is widely available and having both block and file served over the same connection with the same support and management structure is compelling – the community argument
- Highly efficient networks are easy to build over IP with optimal (shortest path) routing while Layer 2 networks use bridging and are limited by the logical tree structure that bridges must follow. The effort to combine routers and bridges (rbridges) is promising to change that but it will take some time to finalize(and we don’t know exactly how it will operate). Untill then the scale of Layer 2 network is going to seriously limited – the scaling argument
As a side argument – a performance comparison made in 1998 showed SCSI over TCP (a predecessor of the later iSCSI) to perform better than FCP at 1Gbs for block sizes typical for OLTP (4-8KB). That was what convinced us to take the path that lead to iSCSI – and we used plain vanilla x86 servers with plain-vanilla NICs and Linux (with similar measurements conducted on Windows).
The networking and storage community acknowledged those arguments and developed iSCSI and the companion protocols for service discovery, boot etc.
The community also acknowledged the need to support existing infrastructure and extend it in a reasonable fashion and developed 2 protocols iFCP (to support hosts with FCP drivers and IP connections to connect to storage by a simple conversion from FCP to TCP packets) FCPIP to extend the reach of FCP through IP (connects FCP islands through TCP links). Both have been implemented and their foundation is solid.
The current attempt of developing a “new-age” FCP over an Ethernet link is going against most of the arguments that have given us iSCSI etc.
It ignores the networking layering practice, build an application protocol directly above a link and thus limits scaling, mandates elements at the link layer and application layer that make applications more expensive and leaves aside the whole “ecosystem” that accompanies TCP/IP (and not Ethernet).
In some related effort (and at a point also when developing iSCSI) we considered also moving away from SCSI (like some “no standardized” but popular in some circles software did – e.g., NBP) but decided against.
SCSI is a mature and well understood access architecture for block storage and is implemented by many device vendors. Moving away from it would not have been justified at the time.
Nice recap of the thinking behind iSCSI, it prompted this response from Zack Best…
The real debate here is between two types of networks. The first is reliable at the link level and does not drop packets under congestion. The second is running a reliable transport protocol (i.e. TCP) over an unreliable link level network.
I agree with the scaling argument. For sufficiently large networks, reliable link level doesn’t work well because network component failure, or chronically congested links are not handled well. For sufficiently small networks, reliable link level has some significant advantages in simplicity, low hardware cost, performance, and worst case latency.
My personal view is that the vast majority of enterprise storage networks fall in the “sufficiently small” category. This view has to some extent been vindicated by the continuing success of Fibre Channel in this space and the inability of iSCSI to displace FC in any significant way for enterprise storage. Of course, this may or may not change in the future.
Whether FC is simpler than iSCSI depends largely on your definition of simplicity. If one defines simplicity/complexity as the number of gates or lines of code to reduce the protocol to hardware or firmware, then my experience is that iSCSI is 2X to 3X the complexity of FC. This has implications in cost and reliability.
Particularly problematic with iSCSI is the unpredictability of the performance. Performance is great with no packet drop. However even a small amount of congestion can cause a sudden large drop and performance. This can be difficult to predict as a network that is almost but not quite congested can run great, but a small incremental change of any sort can cause the performance to become suddenly unacceptable.
For FC, or other protocol using link level flow control, the reduction in performance is much more graceful and incremental when the level of congestion is small and intermittent.
A second major problem with iSCSI is the unbounded nature of worst case latency. When a storage network fails, it is desirable to detect the failure in a fraction of a second and transition to a backup network. TCP, when implemented to the standards, can take many seconds or minutes to determine that a network has failed and close the connection. RFC 2988, for instance, requires that the minimum retransmission be one second. This means a single dropped packet may add one second to the latency of outstanding commands. This is a huge amount of time on a 10G link. No doubt this could be mitigated by drastically reducing the timeouts within TCP, but the market seems to be surprisingly resistant to tampering with accepted standards here.
Overall, the FC and FCP protocol have a lot in common with the Intel i86 instruction set architecture. They are overly complex, and rather poorly designed by modern standards. But they are good enough, and there is a huge amount of value add that has been built on top of them, and therefore little incentive to change.
FCoE is an interesting idea because it preserves 90% of the existing value add of FC, unifies the physical link with Ethernet, and uses the reliable link method of packet delivery.
There are two significant possibilities for iSCSI to displace FC (or FCoE) in enterprise storage networks. First is if the networks start to scale to large enough size that FC can’t be made sufficiently reliable, and second if CPU compute cycles become sufficiently cheap that the iSCSI protocol can be run in host software with no negative performance impact.
Barring either of these, it seems that iSCSI will have an uphill battle, and FCoE may have a place.
Excellent comments. My take (if not obvious from the previous text) is that data centers will be very large and compute power (as evidenced by the multicore) and advances in stack implementation are bound to improve substantialy the performance of the protocol stacks (see Intel and our work) and layer 3 switching.
It is important also to point out that Ethernet has substantial latencies if only bridging is using and replacement technologies (such as Rbridges or others) may take some time to appear.
NAB from linux-iscsi.org responds:
A quick comment in regards to the abundance large of computing resources available for initiator side software IP storage services.. Also Julo, many thanks for posting this great thread. 🙂
As the progress of the DDP TWG continues onward and 2nd generation hardware iWARP engines start to come online, the benefit of a hybrid software implementation with host OS software network stack modifications in kernel above TCP and SCTP starts to pose a question..
What real savings can a hyrbid iSER nodes using software DDP? What are those changes required to make high performance software DDP a reality..?
As osc-iwarp has found out, there is a significant CPU overhead assoicated with sockets and software VERBS, but I think this can be minimized with the right set of changes. Those changes are moving away from receieve side sockets for software iSER mode. These changes will start to become attractive for new product designs as this will allow RNIC hardware engines to scale futher using a more sane method or less painfully (depending on who you ask, OFA uses a Hybrid IB-VERBs) than traditional TOEs with speciality engines. Really taking advantage of what metadata in DDP and iWARP metadata is telling about the framed network transport can help in RDMA WRITE scenarios because the software RNIC would already have Stagged memory ready to go in the iSER case.
Especially when it comes to the API for the iSER stack, having a single codebase with vendors writing hardware drivers instead of re-inventing the wheel with sockets. I believe the smart software RNICs of the future will direct RDMA traffic directly into host OS SCSI memory buffers, and like today use something similar to sendpage() for TX.
As multi-core microprocessor designs with large, intelligent shared caches, and CPU cache coherentecy and I/O interconnects that in the 90’s where only available in the Alpha EV67 and highest of high end shared memory supercomputers and clusters are now starting to become the norm.
Pushing software iSER to the next level and beyond is surely not going to happen with a 30 year old API (sockets). Also for the data center story with a traditional tiered SAN architecture and software case, the hyrbid iWARP software stack on the initiator will not get a whole lot of interest until it can show improved performance and overhead that is acceptable to traditional iSCSI today. For the 3rd generation IP storage stacks, typical multiport 1G workloads is what will really drive interest into areas where putting a hardware RNIC will not be cost feasable for some time.
But just as with traditional iSCSI, we can also scale software iSER down towards towards platforms with more modest computing resources on low-power, wireless devices. Even in the type of mobile devices that IP storage services have been prototyped on today, the benefit of being able to scale server side hardware RNICs more efficently is not software iSER’s only benefit. On a side note, I think the transparency that connection recovery in traditional iSCSI and iSER allows to internexus multiplexing, as well as end user requirements for configuration and management scenarios. Using a active-active recovery mechinism that is as close to completely transparent as possbile (which ERL=2 is IMHO) is I think what mobile IP storage services users need to be demanding from their transports.
Thanks for listening!
Great comments. You are all certainly aware that sockets are also undergoing transformation (asynchronous sockets) but even with synchronous sockets and some care not to break existing application that use synchronous sockets a restructuring of the stack may enable (as shown by the Intel and IBM-Haifa work) great increases in performance.
Software RDMA for the new class of of multicore engines is definetly an interesting proposition (on highly multithreaded engines it should come with not cost associated with it – or almost no cost).
I wish I knew more about the decrease in latencies in the switch fabric (it would be interesting if somebody could comment) as large Layer-2 fabrics have some inherent latency issues.
FCoE is asking us to forget all athis and go back and pay the hardware price for several more years and ignore the IP-land and nothing that I heard convinced me that we should do so.
Then, comes this post from Silvano Gai, who does not seem at all happy that this discussion is taking place at IPS rather than at ANSI T-11 “where it belongs” —
Quoting: FCoE is asking us to forget all athis and go back and pay the hardware price for several more years and ignore the IP-land and nothing that I heard convinced me that we should do so.
FCoE is not asking you (the ips WG) anything.
FCoE is a proposed item for the FC-BB-5 WG of T11. If you have concern that T11 is making a mistake, I suggest you move this discussion to the T11.3 reflector.
The FC-BB WG will meet the first time to discuss FCoE in Bloomington, MN Wednesday June 6th, 2007.
IMHO, it is a bit premature to discuss the limitations of a technology that is not yet public or defined.
FC originated at IBM, as one of two practical responses to the big fat blue parallel SCSI cable everyone kept tripping over when they walked behind the rack. Together with SSA, IBM developed Fibre Channel as a serialization of SCSI to enable the wire to become thinner. The guys involved were interviewed for my last book on storage and told me, point blank, that they had never intended FC to become a network protocol or to be used to interconnect servers and storage in any sort of network. That FC SANs are called “storage area networks” at all is an oxymoron.
Zack Best is right, but for the wrong reasons. He noted that “The real debate here is between two types of networks.” Very true. Neither FC SANs nor iSCSI SANs are technically networks at all. From what I can glean (and I realize I will be flamed for this), FC is a channel protocol, not a network protocol. iSCSI is an application that happens to use a TCP/IP network, but is not in truth its own network. It is simply another application running over TCP/IP. Am I wrong?
As for NAB’s comments, I believe that major improvements in iSCSI performance will be made with the addition of iSER/iWARP acceleration — delivering performance that far outstrips that of FC. How soon in the current economy? Who knows.
|As for Gai’s remarks, I find myself remembering that often repeated quote made by Bender the Robot from the soon to be renewed TV series, Futurama: “Bite my shiny metal ass.”
The T-11 committee at ANSI, that marvelously vendor manipulated standards body, is what gave us the crappy FC standards we have today — standards that can be implemented to the letter by different vendors in their switches with absolute certainty that competitor switches will not work together.
When is a standard not a standard? When it doesn’t enable interoperability between standards-based products, for one thing. If FC standards are so much more mature and have such a broad ecosystem of cooperating vendors, tell me why are there still interoperability plug fests for FC at University of New Hampshire — a full decade after FC “standards” were first released by ANSI T-11? Shouldn’t we at least be beyond the point of plugging stuff together and crossing fingers that everyone’s blinky lights will come on.
The only argument I have heard for continuing any development in the FC protocols at all, whether that development is done at T-11 or at IETF, is to provide the means to wean all the crack addicts in the Global 2000 off of FC fabrics altogether, and as soon as humanly possible. FC SAN is simply the most expensive way to host data that was ever invented. No surprise that it came to market at a time when everyone was suspending disbelief and investing in dotcoms.
Who needs FC anyway? I mean, think about it. Windows apps don’t need it. Apple apps don’t need it. (Despite products like My First SAN and SAN in a Box, SMBs don’t need it.) So does that make it an enterprise play? I wonder. Most Oracle and SAP apps don’t need it. In fact, if you show me applications that really do need the capture feeds and speeds of FC, I will likely suggest a suitable and intelligent replacement: the mainframe. At least in a mainframe world, you have management — you aren’t exposed to the machinations of mavens of overpriced storage arrays with their deliberate and damnable efforts to obfuscate common management and to lock consumers into a terrible downward spiral of cost.
There, I said it. I feel much better now.