Zelin Wan

I work on LLM post-training and evaluation. The thread across my work is building the training pipeline and the benchmark that judges it. I run supervised fine-tuning, DPO, reinforcement learning with verifiable rewards (RLVR with GRPO), and reinforcement fine-tuning (RFT) on open-weight models, and I design the agent evaluation environments that measure whether the training actually moved capability.

Model training. SFT, DPO, RLVR with GRPO, and RFT in a multi-phase pipeline on large-scale GPU infrastructure.
Model evaluation. Agent and tool-use benchmarks built as measurement instruments, with deterministic validators, anti-contamination, and capability-vs-refusal metrics.
Reinforcement learning. From RLVR for LLMs to classical deep RL (PPO, A3C, DQN) across my PhD research.
AI security. A CS PhD in game-theoretic and deep RL for cyber-defense, now applied to LLM RL.

On the evaluation side I treat benchmarks as instruments. I build agent and tool-use environments with deterministic validators, anti-contamination controls, bootstrap-CI scoring, and multi-trial averaging, and I rebuild task suites when frontier models saturate them so the benchmark keeps a meaningful discriminative spread. I also work on separating real capability from safety-driven refusal, so a model is not scored as weaker than it is just because it abstains.

I am currently a Senior AI Software Engineer at Postman in San Francisco, and earlier built production ML systems (VQA, RAG, large-image computer vision) at Bobyard. My CS PhD at Virginia Tech was on game-theoretic and deep RL methods for cyber-defense, and I have 15 papers (9 first-author) on reinforcement learning, uncertainty quantification, and cybersecurity. If you are working on post-training, evaluation, RL environments, or the security side of frontier models, I would be glad to talk.

Selected tools and methods

SFT, DPO, RLVR, GRPO, RFT, reward design, agent benchmarks and evaluation environments, anti-contamination, bootstrap-CI scoring, multi-trial averaging, capability-vs-abstention metrics, PyTorch, Python, distributed GPU/CPU pipelines, deep RL (PPO, A3C, DQN), computer vision (ViT, SAM, YOLO), LoRA.