SGI Scales Up HANA On UV NUMA Systems
In-memory processing is the hot new thing in the enterprise datacenter and SGI, with decades of experience building shared memory supercomputers, wants to leverage that expertise and catch the in-memory wave. For the past two years, the company has been designing its next-generation NUMAlink interconnect and the UV Gen3 systems that will make use of it, and is now previewing the "HANA Box" implementation of that future machine, tuned up specifically to create scale-up versions of SAP's HANA in-memory database and application platform.
SGI first revealed its plans to create a HANA-specific variant of its "UltraViolet" UV 2000 systems back in January and Bob Braham, chief marketing officer at the company, told EnterpriseTech that expanding into in-memory transaction processing and analytics on the UV 2000s would increase the total addressable market for the systems by at least an order of magnitude.
The key differentiation that SGI wants to bring to bear on SAP HANA workloads in particular is the high bandwidth, low latency of the NUMAlink interconnect that is used to glue either Xeon E5 or Xeon E7 processors from Intel into a shared memory system spanning many terabytes of main memory. With the NUMAlink 5 interconnect used in the UV 1000 (Gen1) systems back in 2010, the interconnect linked into two-socket blade servers based on Intel's Xeon 7500 and E7 processors, implementing an 8x8 (paired node) 2D torus that delivered up to 16 TB of shared memory across a total of 256 sockets. With the UV 2000 (Gen2) machines, SGI switched to lower-cost Xeon E5-4600 processors from Intel across 256 sockets, delivering up to 2,048 cores, 4,096 threads, and 64 TB of main memory across four racks using the NUMAlink 6 interconnect. The nodes were also based on a single-socket design; two of these snap together using the NUMAlink 6 hub chip. The NUMAlink 6 bandwidth had about 2.5 times the bandwidth of the NUMAlink 5 interconnect and in theory, the system could have held 128 TB of shared memory when it launched in June 2012 but the Xeon E5 and E7 processors top out at 46-bits of physical memory addressing, which works out to only 64 TB. Until Intel expands this by a few more bits – perhaps during the upcoming "Haswell" Xeon generations due later this year – 64 TB is the largest memory any Xeon chip can address. SGI is already at the maximum on that front.
But for in-memory processing, the speed of the memory access is as important as the amount you can cram into a box. And that is why, as Eng Lim Goh, SGI’s chief technology officer, explained to EnterpriseTech back in March, that the HANA box would be the first machine to deploy the NUMAlink 7 interconnect. The idea, Goh said, is to get more uniformity between accesses of local memory attached to a CPU where processing is being done on a set of work and remote memory on the other side of the system.
SGI is not going to reveal all of the feeds and speeds of the NUMAlink 7 interconnect but Bill Dunmire, senior director of product marketing at SGI, said that the new interconnect would have about 20 percent lower latency than its predecessor. Here is the scale of memory access speeds, just to give you a feel for it.
To pull data from memory attached to a Xeon processor socket takes about 100 nanoseconds, more or less, says Dunmire. In a NUMA shared memory system with two or four sockets, it takes less than 200 nanoseconds for a CPU to access memory on a remote processor on the motherboard through QuickPath Interconnect (QPI) links. With NUMAlink 7, accessing any memory in the complex can be done in under 500 nanoseconds and can be done in one hop over the interconnect the way the topology is designed. If you have an 10 Gb/sec switched network linking nodes in a cluster, the latency across the software stack and the switched network is roughly somewhere between 2 and 5 microseconds just for the hardware and software stack, and this will be slowed down by traffic loading on the network as the cluster gets busy. (NUMAlink 7 will be capable of pushing 14 GT/sec, just for another tidbit that SGI dropped about the future interconnect.)
While the current UV 2000 with NUMAlink 6 interconnect and the future UV line (presumably to be called the UV 3000) with the NUMAlink 7 interconnect are aimed at computationally intense workloads, the SAP HANA appliance SGI is creating, which will be sold as the UV 300H, is aimed at memory-intensive applications and is tuned as such.
The UV 300H will initially be delivered with eight nodes and 6 TB of main memory. The blades on the system have four sockets each and the Xeon E7 v2 was chosen this time as the processor for the machine because it has 24 memory slots per socket, twice as many as the Xeon E5-2600 v2 (for two-socket machines) and the Xeon E5-4600 v2 (for four-socket machines). To get the maximum mix of clock speed, L3 cache memory, and QPI speeds, SGI is employing the Xeon E7-8890 v2 in the UV 300H system. This chip has fifteen cores running at 2.8 GHz, plus 37.5 MB of L3 cache and QPI links that run at 8 GT/sec. (These are the same processors that Hewlett-Packard is using in its "Kraken" HANA-tuned ProLiant Superdome machine, announced this week at SAP's SAPPHIRE event in Orlando.) The UV 300H appliance for HANA will be augmented with a pair of NetApp E2700 RAID disk arrays, which will store HANA data and log files in a persistent fashion to protect against power failures.
Braham tells EnterpriseTech that the initial eight-socket version of the UV 300H will ship in October and that the company has lined up a bunch of beta customers to put it through the paces. Two fatter configurations – one with 16 sockets and 12 TB of shared memory and another that doubles it up again to 32 sockets and 24 TB of shared memory – are due to ship in either late 2014 or early 2015. Pricing has not been set for the machines as of yet, but it is reasonable to expect that SGI will be able to command some sort of premium for that scalability. If SAP allows for it, the company could even double the scalability up one more time to 64 sockets and 48 TB, and if need be, using the same balance of CPUs and memory could, in theory, push even further to 64 TB of memory, but the resulting machine would have a weird number of sockets (85, to be precise). Computers do not like numbers that are not base 2.
Aside from the benefits from lower latency across the NUMAlink interconnect, shifting from clusters to shared memory systems provides a bunch of other benefits to enterprises.
For one thing, you can mix and match data warehousing, analytics, and transactional systems all on the same platform – something you cannot do with clusters. SAP itself runs its data warehouse on clusters but its transaction systems run on big memory machines. (They happen to be a lot smaller than the HP Kraken or SGI UV 300H machines, as EnterpriseTech has previously reported.) The cache coherency in the shared memory system, which keeps both cache and main memories in synch across the nodes in the tightly coupled nodes, is done in hardware and is very fast. But on a cluster this has to be done across a switched network and in software, and that also adds latency. This is why SAP does not recommend running its Business Suite atop the HANA database on cluster configurations.
Braham says that when SGI started working with SAP on fat HANA systems, the idea was to get some leverage with a server consolidation play, touting the management and performance benefits of a shared memory system over traditional Ethernet or InfiniBand clusters. This is much the same playbook that SGI uses in its traditional supercomputing market and for HANA in particular, SGI reckoned that the UV 300H might be appropriate for 5 to 10 percent of the largest HANA shops that needed more memory than a four-way or eight-way Xeon E7 server could deliver.
But after developing the machine, SGI has realized that the UV 300H is not just appropriate for transactional systems that have run out of gas and do poorly across clusters, but as a system that can put all data – transactional and analytical – on the same box and not adversely affect the performance of transactional systems. Complex joins of database tables, which bring clusters to their knees, are done light greased lightning on a shared memory system, says Dunmire. The fact that complex SQL queries will dim the lights on big iron systems designed for transaction processing is why data warehouses were carved out in the first place two decades ago. In-memory is going to perhaps reverse that trend, or at the very least diminish the role of the data warehouse to deep, time series analysis.
The UV 300H system from SGI runs SUSE Linux Enterprise Server 11 SP3, just like all other SAP HANA appliances have been up to this week. Red Hat has just partnered with SAP to get a modified variant of RHEL 6.5 underneath HANA, which has been tweaked with special library tunings and the XFS file system. Braham says SGI is looking at supporting RHEL on its HANA machines but said that it was probably going to happen with the RHEL 7 release coming later this year, not the RHEL 6.5 update that Red Hat and SAP have just launched to get started. SGI is making no commitments at this point and will be, as all IT suppliers are, led by customer demand. The UV 300H runs HANA SP7, as all new appliances do.
SGI is not going to preannounce products but did hint to EnterpriseTech to expect a future UV line to offer a mix of Xeon E5 and Xeon E7 processors for computationally intensive work and machines like the UV 300H tuned up specifically for other tier-one enterprise applications. Oracle in-memory databases come immediately to mind, particularly with Oracle making a big in-memory announcement next week. But it stands to reason that the computational variants of the so-called UV 3000s will be a hot topic at the SC14 supercomputing conference in New Orleans.