A tale of GPU memory limits, CUDA version mismatches, and the torchrun command that saved the day.
"It's just containerized ML. How hard can it be?"
I said that in December 2024 when I started porting NVIDIA's Financial Fraud Detection training pipeline to Kubeflow on EKS. The original ran on SageMaker, and moving to Kubeflow seemed like a weekend project. Six weeks, countless OOM kills, and one very expensive GPU bill later, the pipeline finally worked.
This is the story behind PR #14 in the aws-samples/sample-financial-fraud-detection-with-nvidia repository - where a "simple port" turned into a deep dive on GPU infrastructure, CUDA compatibility, and why you should never underestimate memory limits.
The goal was simple: process 24 million credit card transactions from IBM's TabFormer dataset, train a Graph Neural Network combined with XGBoost, and output a model for Triton Inference Server. The reality was anything but.
CUDA versions will haunt your dreams
The first real problem appeared as a one-line error message that tells you absolutely nothing:
Segmentation fault (core dumped)
This happened the moment cuDF tried to do anything. No stack trace, no helpful error, just silence and a dead container. The GPU nodes had CUDA 13.0 drivers, but I was using a RAPIDS image built for CUDA 12.5. These version mismatches don't throw nice errors. They segfault.
The fix was embarrassingly simple once I understood the problem:
# Wrong - silent death
RAPIDS_IMAGE = "rapidsai/base:24.12-cuda12.5-py3.12"
# Right - actually works
RAPIDS_IMAGE = "rapidsai/base:25.12-cuda13-py3.12"
I now check node CUDA versions before doing anything else. kubectl get nodes -l nvidia.com/gpu=true -o jsonpath='{...nvidia.com/cuda.runtime-version}' is burned into my muscle memory.
The OOM saga
This was our longest battle. I thought I understood GPU memory. I did not.
Attempt one used a g4dn.xlarge: 16GB GPU memory, 16GB system RAM, $0.53/hr. The node OOM'd so hard the kubelet died. Not the pod, the entire node.
Attempt two used a g4dn.2xlarge: same 16GB T4 GPU, but 32GB system RAM. Still OOM. The T4 just doesn't have enough VRAM for 24 million rows in cuDF.
Attempt three used a g5.2xlarge: A10G GPU with 24GB VRAM, 32GB system RAM, $1.21/hr. OOM killed again, but this time system RAM was the bottleneck, not GPU memory.
Attempt four used a g6e.2xlarge: L40S GPU with 48GB VRAM, 64GB system RAM, $1.86/hr. Still OOM.
At this point I wanted to throw my laptop into the ocean.
The problem wasn't the instance. It was one line buried in my pipeline definition: set_memory_limit("28Gi"). Even with 64GB of node RAM available, Kubernetes obediently killed my pod the moment it touched 28GB. The fix was adjusting the limit to match reality:
preprocess_task.set_memory_request("16Gi").set_memory_limit("50Gi")
Two days of debugging. One line of code.
Torch distributed doesn't initialize itself
With preprocessing finally working, training failed immediately with KeyError: 'LOCAL_WORLD_SIZE'. I added the environment variables. Then it failed with Default process group has not been initialized.
The NVIDIA training container expects torch.distributed to be set up. I was overriding the container's entrypoint with python main.py and bypassing all the initialization logic. The actual entrypoint uses torchrun:
ENTRYPOINT ["bash", "-lc", "torchrun --standalone --nnodes=1 --nproc-per-node=1 main.py --config config.json"]
Once I switched to using torchrun instead of raw python, everything just worked. The distributed initialization happens automatically, even for single-GPU training.
The lesson here is don't fight the container's design. If someone built it to use torchrun, use torchrun. Setting environment variables manually to work around it is fragile and will break in ways you don't expect.
Don't rewrite working code
Early on, I decided to "simplify" the preprocessing script. The original was 900 lines with all sorts of feature engineering that seemed excessive. I rewrote it in 200 lines.
It didn't work. cuDF's str.replace defaults to regex mode, so data[COL_AMOUNT].str.replace("$", "") silently failed. Pandas handles this gracefully. cuDF does not.
After fixing that, I found three more subtle bugs. Then two more. Eventually I threw away my rewrite and copied the original src/preprocess_TabFormer_lp.py with minimal modifications.
Working code is working code. Don't touch it unless you have to.
What the final pipeline looks like
┌──────────┐ ┌──────────────┐ ┌────────┐ ┌────────────────┐
│ Download │ → │ cuDF Preproc │ → │ Config │ → │ GNN+XGB Train │
│ from S3 │ │ (RAPIDS/GPU) │ │ Writer │ │ (NVIDIA/GPU) │
└──────────┘ └──────────────┘ └────────┘ └────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Shared PVC (100Gi) │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────┐
│ Upload to S3 │
│ Model Registry│
└───────────────┘
The preprocessing step runs on a g6e.2xlarge with 50GB memory limits. Training uses torchrun for proper distributed initialization. Everything passes data through a shared PVC because S3 round-trips for multi-gigabyte intermediate files are painfully slow.
Total runtime is about 25 minutes. Total cost per run is roughly $0.78. Not bad for training a fraud detection model on 24 million transactions.
What I learned
Match your CUDA versions. This is non-negotiable. Mismatches don't throw helpful errors, they segfault.
GPU memory and system memory are different problems. cuDF operations use VRAM, but data loading uses system RAM. You need enough of both, and Kubernetes memory limits apply to system RAM regardless of how much the node actually has.
Start with oversized instances and optimize later. I wasted hours on OOM debugging that could have been avoided by starting with a g6e.4xlarge and scaling down once things worked.
Copy working code. Don't rewrite it. Especially when the "simplification" involves GPU libraries with subtly different behavior from their CPU counterparts.
Karpenter's WhenEmpty consolidation policy is your friend. GPU nodes are expensive. Having them scale to zero when there's no work saves a lot of money.
Running the pipeline
The full implementation is available in the repository. The interactive notebook at notebooks/kubeflow-fraud-detection.ipynb has everything inline. Run it from a Kubeflow Notebook Server in the team-1 namespace and it will submit the pipeline, monitor progress, and test the Triton endpoint when training completes.
For the compiled YAML version:
cd workflows && uv run python -m workflows.cudf_e2e_pipeline
kubectl create -f fraud_detection_cudf_pipeline.yaml -n team-1
The model lands in S3 at s3://ml-on-containers-{account}-model-registry/model-repository/ where Triton picks it up automatically.
If you're interested in the technical details of the implementation, check out PR #14 which has all the code changes, debugging notes, and the full conversation around solving these problems.
See you next time reader, Shardul
Published: January 6, 2026