6 things I'd do differently in my PhD

Why I’m writing this#

If you’re starting a PhD in ML, or three years into one and feeling stuck, this might be useful. The question that keeps coming up from friends still in the program is the same: what did five and a half years actually teach you? My handful of answers have started repeating themselves, so I figured I’d write them down properly.

None of this is about the technical content of the research. That part is in the papers. This is the rest: six things I’d do differently if I were starting over in 2021. A few of them I learned the slow way. A couple I’m still learning. None of them are unique to ML, but the saturated-field framing is.

How do you figure out what to work on?#

My qualifying exam was on object detection and tracking. I had assumed that was where I’d stay. What actually happened is that the first year of the PhD was mostly me saying yes to whatever my labmates were doing. I helped a senior student run experiments. I joined a team for a remote sensing competition we didn’t expect to win, and we didn’t. But to compete we had to take the Vision Transformer apart in detail, and that detail turned into the seed of the first paper I published the next year.

By the time I knew I liked masked image modeling I’d been working on it for months without calling it that. Looking back, the worrying I did the first six weeks about choosing the right area was wasted. The area chose itself once I was reading the right things.

What I’d do differently in year one: stop optimizing for the topic. Pick three labmates whose work seems interesting and ask if you can help. Six months in, the topic you end up with will be the one you actually want.

What I got wrong about applied research#

I came in with a vague sense that foundational research was where the real ideas lived, that applied work was just turning the crank on existing methods, and that I’d learn the most by doing things myself. Only the last bit was partly right.

Foundational papers in ML almost always have five to ten authors and enough compute to run hundreds of experiments. Trying to do one alone with one real collaborator and a modest cluster is not a realistic plan. Applied work, on the other hand, is harder than it looks. The methods that look fine on standard benchmarks tend to break on real datasets, and the research that actually matters is in understanding why. The medical imaging paper I’m proudest of came from exactly that gap, between what the benchmark numbers said and what was happening on the data we cared about.

If I were doing this over, I wouldn’t measure a research direction by how foundational it sounds. I’d measure it by whether the gap between the published numbers and the data I actually have is wide enough to publish something honest in.

How do you read a saturated field?#

The instinct I had to unlearn early on was the engineer’s instinct: see the problem, think hard, build the solution. That works for most engineering. In ML research, in a saturated field, it doesn’t. Almost every niche problem has a paper attached to it. When you don’t have an exact match, there’s usually one for an adjacent space. You want a vision model for dental scans and you find one for CT instead. The job stops being about figuring it out alone. It starts being about reading enough to know what’s already been figured out, and learning to tell good research from bad.

For me that meant building a map. When I started working on masked image modeling I went through every paper that cited MAE and BEiT. First pass was abstracts and results tables only. I kept a running list with paper names, datasets, experiments, and reported numbers. The common datasets surfaced quickly: ImageNet for linear probe and finetune, ADE20K for semantic segmentation, COCO for object detection. For the strongest results I sketched the architectures on a whiteboard and took photos so I could put them side by side later. The differences started to cluster: RGB-based, codebook-based, EMA self-distillation, teacher-based like CLIP. I read the ablations to see what part of each model was actually doing the work. By the end I had a real mental model of what had been tried, what worked, and where there might be room to add something.

The protocol is useful but it’s not the point. The point is the skill underneath: learning to read papers based on results rather than author names or citation counts. Once you have that, you’ve earned the right to theorize about improvements. Before that, you’re guessing.

The thing I’d do differently from day one: build the map before having any ideas. List the papers, group them by approach, photograph the architectures, keep the numbers. No theorizing until the map exists.

How to escape incremental improvements#

Once the map existed I hit the next wall. Every idea I had was just a combination of two existing papers in my subfield. That’s incrementalism, and in a saturated field it’s what every other PhD student is also trying at the same time. I’d gone deep in my own subfield. The next move had to be sideways: spending real time on NLP papers, on object detection, on diffusion. Different problem shapes, different design constraints, different architectural moves.

After a few months the ideas that started showing up had a different flavor. They started looking like: this trick from a completely different field would fix the component I knew was simplistic. The mechanism is straightforward in hindsight. You start from a baseline, you understand which components are weak, and then you look for any field that’s solved a structurally similar problem. The intuition for improvement comes from the gap between what is clearly not working in your component, and what you’ve seen work somewhere else.

Something I’d change from the start: spend a third of my reading on adjacent fields. Not for hobby reasons. For the architectural moves you’d never invent from inside your own.

What I do when I’m stuck#

When you hit a roadblock the temptation is to assume it’s a dead end and walk away. Most of the time it isn’t. Most of the time the way out is to talk through it. I made a habit, especially when I was stuck, of building slides for whatever I had tried and walking my labmates and advisor through the bad results in detail. Sometimes someone in the room handed me a new direction. Sometimes their idea was wrong and I spent a minute explaining why, and the explanation itself was the thing that unstuck me. The parts I was most tempted to skip over were always the parts where I’d made a bad call and was defensive about it. Those were the parts I most needed to show. Bad calls happen, especially in the first year. Hiding them is what keeps you stuck.

The harder version of the same skill is knowing when to back off from a bad idea after you’ve already spent a long time on it. I did that with a paper currently in review. For most of the first year of that project I was trying to combine three losses on top of an existing masked-image backbone: distillation, reconstruction, and global alignment. I tried attention-based masking. I ablated every component. I tuned the loss weights by hand. I used GradNorm to auto-tune them. The numbers stayed flat for months. I had a Weights & Biases tab open in my browser for most of that year, refreshing it every few hours hoping the next run would move the line.

What I had been ignoring was the training data itself. The three losses were competing, not just globally but at the patch level. Different regions of the image needed different losses at different points in training. Once I saw that, I stopped trying to combine the losses better and changed the question. What if the model decided which patches got which loss? I built a mixture-of-experts router that did per-patch loss weighting, with each expert specializing into what one of the losses was actually good for. That ended up being the contribution.

What I’d do differently: when the metrics stop moving for a week, close the W&B tab. Sample fifty images by hand, color-coded by whatever you’re optimizing. The pivot you’d never come up with from theory comes from staring at the actual instances for an hour.

Why does open code matter so much?#

In my second year I picked a SOTA paper to build on. It came from a respected lab that usually released code with their papers. They hadn’t released this one. The unofficial implementation I found didn’t work, so I started rebuilding from the previous paper’s released code, line by line. After many failed attempts where I kept thinking I was close, it took about a year to get the finetuning numbers to match what they had reported.

Was the year worth it? Not really. I eventually published a paper on top of it, but I could have picked a different baseline, something I could verify even if it wasn’t the latest SOTA, and saved most of that time. The downstream semantic segmentation numbers never quite matched theirs. My own semantic segmentation numbers were also weaker than I’d hoped, on a similar architecture, and I couldn’t say anything to reviewers about it because I couldn’t prove anything. I know that codebase inside out. I have my suspicions. The lesson stuck: a paper without open code is a paper you can’t argue with. Treat the reported numbers as an upper bound, and weight a baseline by whether you can actually verify it.

After that year my own work shifted. I started treating GitHub like the long version of my CV. I kept repos private during review, then made them open the day each paper landed. I uploaded weights to HuggingFace. I built small demos when I could. I kept the code reproducible enough that I’d be willing to walk a reviewer through any line of it. The selfish reasons are real: clean code makes your follow-up papers easier and your own re-runs less painful six months later. The recruiting angle is real too. But the underlying reason is the bigger one. After a year of wishing somebody else’s code was open, the least I could do was return the favor.

The single thing I’d change about this whole stretch: before picking any SOTA baseline, reproduce their published numbers first. Get that done in week one, not month six. If the code isn’t public, or the numbers don’t reproduce on the kind of cluster you actually have, pick a different baseline and don’t look back. The year I spent learning this lesson the hard way is a year I don’t want anyone else to spend.

Where I am now#

I defended in April. I graduated in May. This past week I joined an Amazon Robotics team in Berlin, working on robotics with LLMs, and I’m pretty excited about it. Robotics + LLMs is far enough from where I spent the PhD that I’ll be running this whole loop again from the start: building a new map of a field I haven’t read, getting things wrong in front of new colleagues, asking them to walk me through what they’ve tried. Most of what I learned in the PhD I learned exactly that way, by getting things wrong in public and then walking my labmates through what I’d done. Writing this is doing the same thing out loud.

If any of this resonates, or rubs you the wrong way, leave a comment below. I’m especially curious what other people’s version of this list looks like, including engineers who never went through grad school. The lessons might be very different. They might also be the same.

How did this read?

Comments

Loading comments...