When Purdue University was established back in the 1860s, its founders – including John Purdue – could not in their wildest dreams have imagined how their little agricultural college would evolve over the next century and a half.
If you could somehow transport the original six instructors and 39 students to today’s campus – which is the size of a large town and supports nearly 40,000 students – they would be aghast. For example, what would they make of the use of sensor-equipped drones by the agricultural and mechanical engineering colleges to collect massive amounts of data from acres of farmland fields?
Once they got over their initial culture shock, the deluge of big data being processed on one of the nation’s largest campus cyberinfrastructures might be regarded as a major miracle to our visitors from an earlier time.
Today, Purdue’s IT organization supports the needs of up to 1,000 researchers working on several hundred concurrent research projects. The university has a lot invested in HPC horsepower – three of their systems are currently listed on the Top500. Purdue supports the nation’s largest academic distributed computing grid and the largest collection of science and medical online hubs. Constantly growing big data volumes are generated by the university’s top research areas including computational nanotechnologies, aeronautical and astronomical engineering, mechanical engineering, genomics and structural biology. Even the university’s College of Liberal Arts and Department of Sociology are making big data demands on the IT infrastructure. As part of the solution, Purdue has deployed a Data Depot, a high-capacity data storage service supporting university researchers across all fields of study.
IT has their work cut out for them. Says Mike Shuey, research infrastructure architect at Purdue, “The challenge of managing varied research needs is accommodating very large parallel I/O jobs and millions of small, random read requests without imposing performance penalties on anyone.” This required a single large-scale, centralized data repository that could be accessed from multiple HPC systems.
Purdue has deployed a pair of DDN SFA 12KX storage systems with SFX, and 6.4 petabytes of raw capacity for the university’s GPFS parallel file system. And, to ensure reliable, fast access to the Data Depot, IT has also deployed DDN SFX Software to extend the storage cache with solid-state memory. This ensures that the system loads the right data into flash storage at the right time to maximize cache hit rates and provide a fast response.
“What was most appealing about DDN’s SFX technology is that it removes much of the work from the underlying storage,” comments Shuey. “Since millions of read requests can be served from cache, we can meet all research needs without impacting overall storage performance.”
Big Data, Big Benefits
He says that DDN’s SFA storage with SFX technology has delivered a 900% improvement in read capability at low cost while enabling users to access millions of small files on dedicated solid state modules while simultaneously streaming very large data files. “Simple data queries that used to take two minutes now take two seconds,” Shuey says.
In fact, the IT department was so pleased with DDN’s storage solutions that it made a follow-on purchase of the EXAScaler storage appliance featuring the Lustre parallel file system with SFX. Explains Shuey, “We procured another compute cluster and needed more local storage for researchers. DDN’s EXAScaler Lustre storage appliance is a good fit here as well, since it will help us push the boundaries of next-generation HPC applications.”
And what about our visitors from the 19th century? Would seeing their little school transformed into a 21st century HPC powerhouse be their most impressive memory? Or the fact that you can get a cardboard container full of hot Frappuccino latte at the street corner coffee shop. We’ll never know.