2025 · 01 · 2212 min readdeep learning · medical imaging

What U-Net taught me about reading MRI slices.

Field notes from a TÜBİTAK 2209-A funded undergraduate project on brain tumor segmentation. Less about architectures, more about the unglamorous decisions — normalisation, patch sampling, honest evaluation — that decide whether your Dice score means anything at all.

I have spent the better part of a year training U-Nets to outline glioma on brain MRIs. The work is funded by a small TÜBİTAK 2209-A undergraduate grant — see the longer write-up on the research page — and the headline result, "Dice ~0.87 on whole-tumor", is the part that fits on a poster.

It is also the least interesting part of the project. What I actually want to write down, before I forget, is what training a segmentation network changed about how I look at a brain scan. The network is a forcing function. It will not let you be sloppy about the things you used to be sloppy about. This is a list of those things.

1. The dataset is not the dataset

The first lie of medical-imaging ML papers is that the dataset is a fixed object. You download BraTS-style multi-modal MRIs — T1, T1ce, T2, FLAIR — and a corresponding set of segmentation labels, and you treat the whole thing as a single thing called "the dataset."

Then you start looking at it. And it turns out:

Some volumes were skull-stripped with one tool, others with another. The leftover dura matter on a few patients confuses every model you train.
The intensity histograms of T1ce vary by an order of magnitude between scanners. Some patients were imaged on a 1.5T machine, some on a 3T.
The "ground truth" labels were drawn by different annotators with different thresholds for what counts as edema. You can sometimes guess the annotator from the label style.

None of these are bugs. They are the actual texture of medical data. But if you treat them as bugs and try to "clean" them away, you will train a model that works beautifully on your sanitised slice of the world and falls apart the moment a real hospital sends you a scan from a different scanner. The dataset is a population, not a thing.

2. Normalisation is half the model

The single biggest jump in my early Dice scores did not come from changing the architecture. It came from changing how I normalised intensities. I had been doing the textbook thing — divide by max, or min-max scale per volume — and it was leaving the network to learn intensity invariance from data, which is a slow and expensive thing to ask it to learn.

Switching to per-volume z-score normalisation on non-zero voxels (i.e. ignore the air around the brain, then standardise) added something like five Dice points overnight. That is more than I ever got from a bigger encoder, a fancier loss, or any of the architectural ablations I ran for two months after.

Lesson: in medical imaging, "preprocessing" is not a chore you do before the real work. It is the work. The model is the cheap part.

3. The class imbalance is worse than it looks

If you measure "tumor vs. background" you get a class ratio that is bad but tractable — maybe 1% positive voxels on average. If you measure "enhancing tumor vs. background", it is closer to 0.1%. A naïve cross-entropy loss will learn, very quickly, that the safe bet is to predict zero everywhere, and your loss will be wonderful, and your model will be useless.

Three things, layered, fixed this for me:

Patch-based sampling. Instead of feeding the whole volume, sample 3D patches that are biased toward containing tumor voxels. ~50/50 tumor-vs-background patches at training time gets the network to actually look at the rare class.
Soft Dice loss combined with categorical cross-entropy. Dice is naturally robust to imbalance because it normalises by the size of the positive region.
Focal Tversky for the small enhancing class. An ablation, but a useful one — it gave the smallest sub-region a fighting chance without destabilising the rest.

None of this is a paper-worthy contribution. All of it is the difference between a model that segments and a model that pretends to.

4. The validation split lies more than your loss

The single most embarrassing bug I shipped, briefly, in version one of this pipeline: I split the data into train and val by slice, not by patient. Each MRI volume has ~150 slices. If you randomly shuffle slices and then split 80/20, then for every patient in val, ~80% of their adjacent slices are in train.

The model was not segmenting. It was memorising patients. My val Dice was nearly a full point higher than the test Dice, and I spent two weeks chasing imaginary improvements before I noticed.

Rule I now write on every notebook: the unit of the split is the patient, not the slice, not the patch, and not the augmented sample. Anything else is a leak waiting to happen.

5. Dice is a mean. Stop reporting only the mean.

Reporting "we achieved a Dice of 0.87 on whole-tumor" is, technically, fine. It is also the kind of sentence that has caused a lot of medical-imaging work to over-promise and under-deliver. Because that 0.87 is a mean across patients, and the underlying distribution is almost always heavy-tailed: most patients are at 0.92, and a long tail of difficult cases is at 0.4-0.6.

Those tail cases are the ones a clinician would notice. A model that performs at 0.92 on easy gliomas and 0.45 on the unusual ones is, in practice, a model that fails exactly where you would most want it to help.

I have switched to always reporting Dice, IoU and HD95 as median and IQR across patients, plus an explicit count of how many patients fall below some threshold (say, Dice < 0.7). It is less impressive on a slide. It is much harder to be wrong about.

6. The model's mistakes are diagnostic of its training data

This is the part that genuinely changed how I look at scans. After enough training runs, you start to notice patterns in where your network is wrong.

If it confuses ventricles for necrotic core, your FLAIR normalisation is probably off.
If it under-segments small enhancing tumor, you almost certainly didn't sample enough of those patches at training time.
If it produces ragged, speckled predictions near the cortex, you are probably training on too few slices and overfitting to mid-brain anatomy.

Once you see this, you cannot un-see it. The error map becomes a kind of mirror of your own data choices. The network is not failing; it is reporting, very precisely, what you taught it to think a tumor looks like.

7. A baseline is more useful than a contribution

One of the under-stated goals of this project — and probably the one I am proudest of — is not the model itself. It is the goal of leaving behind a reproducible U-Net baseline that another undergraduate at my university can fork in a week and run end-to-end on a single GPU.

Most academic ML code I have inherited as a student is, charitably, hostile. A model checkpoint sits in a forgotten Google Drive folder. Preprocessing is half in a Jupyter notebook, half in someone's bash history. The "results" require running three scripts in the right order with the right CUDA version, and the right CUDA version is not in the README.

I want the artifact I leave behind to be the opposite of that. One repo. One training script. One config file. One synthetic-data smoke test that runs in 90 seconds on CPU. A README that assumes nothing and apologises for nothing.

A research project ends when somebody else can reproduce it. Until then, it is a personal achievement, not a public one.
— something my advisor said, which I keep stealing

What I'd do differently next

For the next iteration — and there will be one — the changes I want to try are not architectural. They are all upstream:

Better skull-strip QA before training; potentially a learned skull-strip step.
Histogram matching across scanners, not just per-volume z-scoring.
Per-patient temperature scaling on the output, so uncertainty estimates are actually calibrated.
An honest "out-of-distribution" test set from a different scanner / institution, used only at the very end.

None of these will make for an exciting paper title. All of them, I think, would move the work closer to something a clinician could actually look at without rolling their eyes.

If you are working on something similar — or if any of this contradicts your experience — I'd love to hear about it. Brain tumor segmentation has many crowded leaderboards and very few honest field notes, and I would rather this post be the start of a conversation than the end of one.

u-netmedical imagingdeep learningtensorflowbratssegmentation

More →The full research page ← PreviousFrom RPA to data: notes from the in-between