Blog

Revolutionizing Biomedical GenAI with Hyperscale RAG: DDN Infinia, Milvus and the Full PubMed Dataset 

Revolutionizing Biomedical GenAI with Hyperscale RAG: DDN Infinia, Milvus and the Full PubMed Dataset 

By John Lucas, DDN Field CTO

Have you ever wondered where a Large Language Model (LLM) pulls its information from, only to receive a general, “It’s from various places”? LLMs excel at generating insightful responses but sometimes struggle with pinpointing exact sources.

Here’s why:

  • Training Data Aggregation: LLMs are trained on massive datasets from the web, books, and journals, often stripped of metadata like URLs, identifiers, or publication dates. The data becomes a knowledge stew, losing original source details.
  • Knowledge Compression: They encode info into neural network patterns, not a citation database. Pinpointing a specific article? That’s like finding a needle in a digital haystack.
  • Dynamic Content: Web pages and journals evolve, and LLMs can’t always verify a source’s state in real-time.
  • Design Focus: LLMs prioritize coherent, relevant responses over academic citations. Precise sourcing needs extra systems like APIs or databases, which aren’t always integrated.
  • Hallucination Risks: Without direct source access, LLMs may invent plausible but false references, relying on patterns instead of verified data.

Wouldn’t it be game-changing if you could ask an LLM exactly where its info came from? Imagine querying a top-tier LLM and getting precise details: for a journal article, the journal, volume, page, and publication date; for a web source, the URL and timestamp. With the full PubMed dataset—over 36 million biomedical articles—and a hyperscale RAG pipeline powered by DDN Infinia and Milvus, this vision is not just possible—it’s within reach! Let’s explore how to connect Milvus to DDN Infinia’s S3 API and why this RAG approach is indispensable for minimizing hallucinations and ensuring full traceability in GenAI.

About PubMed

PubMed is a free, publicly accessible online database maintained by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM). It serves as a comprehensive resource for biomedical and life sciences literature, providing access to millions of peer-reviewed articles, abstracts, and citations from journals, books, and other scientific publications.

Primarily focused on medical and health-related research, PubMed includes studies in fields like medicine, biology, pharmacology, and related disciplines. Researchers, healthcare professionals, and students use PubMed to find reliable, citable sources for evidence-based practice, academic research, and clinical decision-making.

Why Attributable RAG With DDN Infinia Is Indispensable

Hallucinations—those pesky, made-up facts LLMs sometimes spit out—are a major hurdle in GenAI, especially in high-stakes fields like biomedicine. A RAG pipeline with the full PubMed dataset, backed by Milvus and DDN Infinia, tackles this head-on:

Minimizing Hallucinations:

  • Grounded Responses: RAG retrieves relevant text chunks from PubMed’s 36 million articles (~50–70 million chunks) before feeding them to an LLM. This ensures answers are grounded in verified data, not just from generative patterns. For example, asking, “What’s the impact of AI on healthcare?” pulls exact PubMed abstracts, reducing the risks of hallucinations.
  • Contextual Precision: By using Milvus’s similarity search, RAG delivers the most relevant chunks, ensuring the LLM has accurate context. PubMed’s focused biomedical content avoids the noise of broader corpora like Common Crawl.
  • No Overgeneralization: Unlike LLMs’ generalized knowledge, RAG references specific documents, preventing the model from “guessing” based on compressed patterns.

Full Traceability:

  • Metadata-Rich Sources: PubMed’s metadata (PMID, DOI, journal, publication date) allows precise source attribution. Users can trace an answer to a specific article, volume, or date, critical for trust in fields like medicine. For instance, a response can cite “Journal of Medical AI, Vol. 23, 2024, DOI: 10.1000/xyz.”
  • Immutable Storage: DDN Infinia’s S3-compatible storage, with its petabyte-to-exabyte scale, securely stores PubMed’s ~300–525 GB of vectors, indices, and metadata, ensuring persistent access to source data. Milvus’s JSON field support preserves detailed metadata for every chunk.
  • Auditability: Researchers or clinicians can verify answers by checking cited sources, fostering trust and accountability. This is vital for regulatory compliance or peer review in biomedical applications.

Scalability and Performance:

  • Hyperscale Capacity: PubMed’s ~50–70 million chunks are a breeze for DDN Infinia’s exabyte-scale storage and Milvus’s trillion-vector capacity. This setup could hypothetically scale to 100x larger datasets, like web archives or multi-domain corpora.
  • Blazing Speed: Milvus delivers <100 ms query latency, and Infinia’s TB/s throughput supports hundreds of thousands of queries per second (QPS), ensuring real-time RAG responses for busy research environments.
  • Dynamic Updates: Infinia’s multi-TB/s write speed and Milvus’s streaming ingestion handle PubMed’s ~1 million new articles annually, keeping the system current without missing a beat.

This RAG pipeline transforms GenAI from a “trust me” black box into a transparent, reliable tool, indispensable for applications where accuracy and traceability are non-negotiable.

The Setup: Milvus Meets DDN Infinia’s S3 API

We’re building a RAG pipeline with the full PubMed dataset, processing 300–525 GB). Milvus stores vectors and metadata, while DDN Infinia’s S3 API handles persistent index files and logs. Here’s the flow:

  1. Prep PubMed: Fetch all 36 million abstracts via NCBI Entrez API or XML dumps, chunking into ~150-word segments with metadata (PMID, DOI, journal, date).
  2. Vectorize: Convert chunks into 768-dimensional vectors using Sentence-BERT.
  3. Store via S3: Save vectors in Milvus, with indices and logs in DDN Infinia’s S3 buckets.
  4. Query and Cite: Retrieve chunks with metadata and feed to an LLM, delivering answers with precise sources.

Prerequisites

  • DDN Infinia cluster with S3 object storage (see DDN docs).
  • Milvus deployed on Kubernetes (on DDN Infinia).
  • Python 3.8+, libraries: pymilvus, sentence-transformers, biopython.
  • PubMed access via NCBI Entrez API or XML/JSON dumps.
  • S3 credentials for DDN Infinia (endpoint, access key, secret key, bucket name).
  • GPU cluster for embedding ~50–70 million chunks.

Key Code Snippets

Below are the critical snippets for connecting Milvus to DDN Infinia’s S3 API and building the RAG pipeline.

Configure Milvus for DDN Infinia S3 Storage

Set up Milvus to use Infinia’s S3 endpoint in milvus_cluster.yaml (Kubernetes):

apiVersion: v1
kind: Secret
metadata:
name: infinia-s3-secret
type: Opaque
...
apiVersion: milvus.io/v1beta1
kind: Milvus
...
bucketName: "rag-vectors"
rootPath: milvus/rag
useSSL: true
dependencies:
storage:
external: true
type: S3
endpoint: s3.infinia.ddn.com:443
secretRef: "infinia-s3-secret"

Apply it:

kubectl apply -f milvus_cluster.yaml

Create Milvus Collection and Insert PubMed Data

Set up the collection and process PubMed abstracts in batches:

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility
from Bio import Entrez
from sentence_transformers import SentenceTransformer
import json, logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Connect to Milvus
connections.connect(host="localhost", port="19530")

# Create Collection
collection_name = "pubmed_rag"
if utility.has_collection(collection_name):
utility.drop_collection(collection_name)

fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="metadata", dtype=DataType.JSON)
]
schema = CollectionSchema(fields, description="Full PubMed RAG")
collection = Collection(collection_name, schema)

# Process PubMed (for ~36M abstracts)
Entrez.email = "your.email@example.com"
batch_size = 10000
model = SentenceTransformer("all-MiniLM-L6-v2")
pmids = Entrez.read(Entrez.esearch(db="pubmed", term="all[filter]", retmax=36000000))["IdList"]
for i in range(0, len(pmids), batch_size):
batch_pmids = pmids[i:i+batch_size]
records = Entrez.read(Entrez.efetch(db="pubmed", id=batch_pmids, retmode="xml"))
...
# Create Index
collection.create_index(field_name="vector", index_params={"metric_type": "L2", "index_type": "HNSW", "params": {"M": 16, "efConstruction": 200}})
collection.load()

Query and Cite Sources

Retrieve relevant chunks with metadata for RAG:

from transformers import pipeline

query = "What’s the impact of AI on healthcare?"
query_embedding = model.encode([query])[0]
results = collection.search(
data=[query_embedding.tolist()],
anns_field="vector",
param={"metric_type": "L2", "params": {"ef": 200}},
limit=5,
output_fields=["text", "metadata"]
)

# Generate Response with LLM
llm = pipeline("text-generation", model="gpt2") # Replace with production LLM
context = ""
sources = []
for i, hit in enumerate(results[0]):
text = hit.entity.get("text")
meta = json.loads(hit.entity.get("metadata"))
context += f"[{i+1}] {text}\n"
sources.append(f"Source [{i+1}]: PMID {meta['pmid']}, {meta['journal']}, {meta['publication_date']}, DOI: {meta['doi']}")

prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
response = llm(prompt, max_length=200, num_return_sequences=1)[0]["generated_text"]

print("RAG Response:")
print(textwrap.fill(response, width=80))
print("\nSources:")
for source in sources:
print(source)

How It Works

  • S3 Integration: Milvus stores index files and logs in DDN Infinia’s rag-vectors S3 bucket, leveraging NVIDIA BlueField-3 DPUs for low-latency RDMA access.
  • Source Attribution: PubMed’s metadata (PMID, DOI, journal, date) in Milvus’s JSON field ensures traceability, e.g., citing “New England Journal of Medicine, 2024, DOI: 10.1056/NEJM123.”
  • Scalability: PubMed’s 300–525 GB) fit comfortably in a single DDN Infinia node, with Milvus scaling to trillions of vectors for future growth.
  • Performance: Milvus’s <100 ms query latency and Infinia’s TB/s throughput support hundreds of thousands of QPS, ideal for real-time biomedical RAG.

The Hyperscale Vision

The full PubMed dataset powers a RAG pipeline that minimizes hallucinations and ensures traceability, critical for biomedical research. With DDN Infinia’s S3 API and Milvus’s distributed architecture, you could realistically scale to a much larger corpora while maintaining precise source attribution. This is a game-changer for GenAI in high-stakes domains!

Get Started

Set up your DDN Infinia cluster, configure Milvus with the S3 API, and access PubMed via NCBI Entrez API or XML dumps. The biomedical RAG revolution awaits!

Resources:

Last Updated
Jun 17, 2025 4:31 AM