At DataFrameworks, we are excited to be part of DDN’s Best Practices for Big Data in Life Sciences Research (BPLS) workshop in Boston on May 23.  Over the years, we have observed many parallels between life science file-based workflows and media file-based workflows. Lessons learned, best practices, and efficiency gains achieved in media workflows can often be applied to life science data pipelines, and vice versa. I have personally witnessed cross pollination for problem solving between these two workflows; for example, the term “pipeline,” which refers to the processes and infrastructure associated with delivering file-based workflows, has been used in association with media workflows for at least 20 years, and in the last few years has become increasingly common in life science workflows as well.

Other similarities include:

  • Instruments of increasing quality that accelerate the generation of data
  • A high percentage of both image-based and sequence-based file types
  • Large amounts of data files being computer-generated in a process that involves human “what if” interpretation and analysis
  • Service-oriented workflows where companies perform services on data that they don’t actually own (in life sciences, this might be sequencing genomes while in media this might be adding visual effects to digital camera images)
  • The need to provide audit trails and security when one company’s data is in the custody of another organization
  • Data inefficiencies that reduce available funding for research or creating better images
  • The raw number of files for both file-based workflows create challenges where manual processes break down and are simply not scalable

Some of the notable differences include:

  • File and directory count profiles (Media and Entertainment has 87,411,831 files/PB and 1,749,882 directories/PB while Life Sciences has 527,997,883 files/PB and 14,765,749 directories/PB)
  • Low profit margins associated with some of the image services (visual effects, color correction, digital film restoration, etc.) have forced media workflows to optimize the efficiency and automation associated with their file-based production lines, simply to stay in business

The raw number of files for both file-based workflows create challenges where manual processes break down and are simply not scalable.

“Necessity is the mother of invention” is a cliché that means, roughly, that the primary driving force for most new inventions is a need. To that end, the data management and storage challenges of life sciences, media, and HPC have a huge impact on how DDN® and DataFrameworks are architecting new products to meet the challenges with the scale of both life science and media file-based workflows.

DDN and DataFrameworks have teamed together to solve some of these tough challenges associated with the large scale of life sciences file-based workflows. We’re glad to be at DDN’s BPLS event here in Boston and look forward to learning about additional challenges as well as sharing our knowledge of the latest trends in data storage in the life sciences industry.

  • Paul Honrud
  • Paul Honrud
  • Founder of DataFrameworks
  • Date: May 19, 2017