The Magellan Project: Building High-Performance Clouds

September 10, 2012

Most enterprises are aware of how cloud computing helps cut IT costs, deploy systems more rapidly, and increase agility. But what about scientific efforts? Can clouds – built on widely available platforms and commodity hardware – help researchers do serious science?

The Magellan system, funded by the U.S. Department of Energy (DOE), is designed to run scientific applications. The system began as part of an evaluation effort to determine whether cloud computing was useful for technical computing workloads. The evaluation proved successful, and today the system is being used to push the limits of technical cloud computing.

The initial phase of the Magellan research project was a collaborative effort, funded by the American Recovery and Reinvestment Act, between Argonne National Laboratory and Lawrence Berkeley National Laboratory (LBL). At Argonne, a large-scale system was constructed to be operated as a private cloud, with the aim of evaluating the usefulness of this approach to scientists. At the National Energy Research Scientific Computing Center at LBL, a smaller system was built to assess a variety of scientific workloads and analyze their suitability to a cloud environment. The combined Magellan resources were made available to about 3,000 scientists and researchers.

The Argonne team began the project by running an early, open source implementation of the Amazon EC2 application programming interfaces. The goal was to assess the maturity of the system software stack, its applicability to scientific applications, and the effects that the cloud model had on scientific users.

“Things were fine at small scale; but once we hit about 100 nodes, the system had substantial scalability and stability problems,” said Narayan Desai, technical lead of the efforts at Argonne. “We expected to see a performance penalty because of virtualization; however, the penalty was considerably smaller than we expected for many scientific applications. Since our goal scale was about 700 nodes, this approach just wasn’t going to work.”

The team began looking at a number of other cloud computing options. One that seemed particularly promising was OpenStack, a large-scale, open source cloud computing initiative. OpenStack was founded to drive community-established industry standards, end cloud lock-in, and accelerate the adoption of cloud technologies by service providers and enterprises. As a cloud operating system, it automatically manages pools of compute, storage, and networking resources at scale and is supported by an ecosystem of technology providers.

“We found OpenStack straightforward to build and deploy, and it scaled much better than our previous platform,” Desai said.

Today, the Magellan cloud consists of about 750 nodes, including 500 compute nodes, 200 storage nodes, a number of big memory (1-terabyte) nodes, and 12 management nodes.

Building a Flexible Environment for Computational Science

The team at Argonne also found that the OpenStack cloud created an ideal platform for prototyping and testing large-scale scientific applications that have not been tailored to the traditional high-performance computing (HPC) environment.

“One of the most surprising findings from the evaluation project was the benefit scientific users derive from direct access to computational resources for their applications,” Desai said. “The flexibility of OpenStack has made a class of users – those who do a lot more prototyping and development on a regular basis – extremely productive. Moreover, this benefit vastly outweighs the performance penalties for many application types, particularly in loosely coupled applications. Both of these conclusions were unexpected.”

Magellan Going Forward

Motivated by the success of the evaluation project, Argonne has been transitioning the system from a testbed into a production-grade system and has been training more researchers in how to best exploit its capabilities.

Another major focus is closing the performance gap between traditional HPC platforms and the OpenStack platform. “While we expect that virtualization will never be free, we need to gain a better understanding of the performance trade-offs, as well as techniques for tuning performance in virtual machines,” said Desai.

To this end, the Argonne team has been building a performance-optimized cloud. Network and storage performance are often cited as key challenges for cloud systems. Since scientific workloads often depend on movement of large data sets from site to site for analysis and visualization, bottlenecks here can have a substantial impact on the progress of science teams.

The team’s first priority was to assess network performance. Using a development deployment of OpenStack, the team demonstrated near-saturation of a wide-area 100-gigabit Ethernet link.

“We were able to achieve 99 gigabits of traffic flowing from 10 virtual machine instances at Argonne to LBL across ESNet, the DOE research network,” said Desai. “We had expected to need many more instances running across 20-30 nodes, but the fact that our network interfaces were the limiting factor was excellent; that demonstrates the low overhead virtualization can have, while leaving room for improved node performance. Even more important, all of our network tuning could be accommodated without any modifications to OpenStack.”

The team has also been working on improving storage performance. The members have built a custom storage solution that delivers more than 2 gigabyes per second per server.

“In aggregate, our storage servers should be able to provide enough bandwidth (12.5 gigabytes per second) to stream data across the 100-gigabit Ethernet link at line rate. Soon, it will be feasible for researchers to dynamically provision cloud resources to move data cross country at tens of gigabytes a second,” Desai said.

Major Applications and Future Plans

Magellan is driven by its applications. Many early users are bioinformatics applications. One major user of the system is the DOE Systems Biology Knowledge Base (KBase), a collaborative effort to build predictive models of microbes, microbial communities, plants, and their interactions. KBase researchers use Magellan to build data-intensive computational methods and services. Another project using Magellan is the MG-RAST metagenomic annotation system, which assesses microbial communities’ composition and metabolic function.

Desai’s team plans to generalize the system to other application domains over the next year, including cosmology and materials science. The team also will continue enhancing network and storage performance in order to meet the needs of additional compute-heavy and data-intensive DOE applications and enable new scientific understanding in a variety of disciplines.

“The cloud computing effort has been a big success,” said Desai. “A lot of researchers have gotten new science done on the OpenStack cloud, and we’re going to keep learning how we can push this cloud further.