[Read] Without a top conference paper, without publishing on arXiv, a blog post became OpenAI's fast pass. Genius scientist Keller Jordan joined OpenAI solely based on his Muon optimizer blog. It might even be used to train the next-generation super model GPT-5.
What conditions are needed to successfully apply to OpenAI?
A traditional academic background? Top conference papers? Studying under AI masters like Hinton and LeCun? Or being a technical celebrity on social media?
Or perhaps, you only need to write a blog post.
Keller Jordan is a machine learning researcher who designed a neural network hidden layer optimizer called Muon at the end of 2024 and publicly documented his research progress.
Soon, community members began parallel experiments, reporting results, and things became increasingly interesting - OpenAI and xAI both noticed him, and ultimately he chose to join OpenAI!
Muon's second author, Yuchen Jin, directly stated that publishing a paper ≠ influence. Perhaps Muon has already been used in GPT-5's training.
Stop Blindly Chasing Top Conferences
Keller Jordan's story is somewhat similar to the sensation caused by DeepSeek's open-source, though their influences are far from comparable, the underlying logic seems to point to the same thing -
In the rapidly iterating AI world, traditional paper models seem outdated. Open, community-built, and quick response might be the only way for humans to keep up with AI's evolution speed.
Shital Shah, a research engineer at Microsoft Research, was very excited upon learning about Keller Jordan's experience, as he has always believed research should be conducted "in this way".
Even in "open" research laboratories, you'll see too many researchers being possessive about "early ideas".
Research sharing usually only occurs among close friends, and for a long time, people have been too obsessed with this...
Any idea requires months to be published through a paper.
And when it's finally published, it's often buried among numerous other papers.
If someone does notice it, improving it requires going through another equally long and difficult cycle.
Keller took a different approach.
He published his preliminary ideas as an open GitHub repository, rather than as a paper.
People can immediately try and improve these contents.
Everyone can verify everything at any time. Since everything is open, there's no room for cheating or exaggerating claims.
This can indeed be called "distributed real-time AI research"! Within just a few days, Keller and others improved the idea. People who saw potential joined and helped with parallelization.
In traditional AI research, this feedback cycle would have taken over 6 months, not just 6 days.
Regarding publishing papers versus "speed-running technology", Keller Jordan's view remains the same as six months ago. Today, he reposted a tweet from February, stating that although Muon became popular and helped him enter OpenAI, he will not write a paper for Muon.
Keller Jordan's meaning is clear: compared to a paper likely to be "buried" on arXiv, it's better to honestly continue researching his "optimizer".
He even "specifically" expressed his view on current AI optimizer papers - "They are all fake, all fluff".
Impact > Reputation
Speaking of which, what is Keller Jordan's background, having been recruited by OpenAI with just a blog?
He obtained a double bachelor's degree in mathematics and computer science from UC San Diego in 2020, and also studied at UC Santa Cruz and UC Berkeley.
After graduation, he worked as a machine learning engineer at Hive, a company focused on generative AI. Subsequently, he served as a visiting researcher at the Vienna Complex Systems Center.
In December 2024, Keller officially joined OpenAI.
Among all his GitHub projects, the most influential is Modded-NanoGPT, with over 2.4k stars.
Keller and his team reproduced the GPT model in just 3 minutes using 8 H100 GPUs, processing only 0.73B tokens.
He has a personal blog that hasn't been updated since joining OpenAI, with the last article being about the Muon optimizer.
What exactly does this Muon article discuss?
An Optimizer Breaking Training Speed Records
In the deep learning field, optimizers are core tools driving model training efficiency and performance.
Until December 2024, a Muon optimizer emerged, refreshing NanoGPT and CIFAR-10 training speed world records with its outstanding performance.
Muon is an optimizer designed for neural network 2D parameter hidden layers.
Its core idea is that update matrices generated by SGD-momentum are orthogonalized through Newton-Schulz iteration, generating updates close to semi-orthogonal matrices, thereby improving training efficiency.
It is simple and efficient, supporting stable operation at bf16 precision, significantly reducing computational overhead.
Compared to the AdamW optimizer, Muon performs impressively across multiple tasks.
In CIFAR-10, it shortened the training time to reach 94% accuracy from 3.3 to 2.6 A100 seconds, an improvement of about 21%.
For NanoGPT training, Muon improved training speed by 1.35 times on the FineWeb dataset to reach a validation loss of 3.28.
Moreover, Muon maintains training speed advantages on 774M and 1.5B parameter models.
Training a 1.5B parameter Transformer model to GPT-2 XL level, Muon only requires 10 8xH100 hours, while AdamW needs 13.3 hours, improving efficiency by about 25%.
So, how significant is Muon's influence in the AI circle?
The Microsoft team used the Muon optimizer in their January paper.
Some machine learning experts have specifically analyzed this, and more research is embracing the Muon optimizer.
Muon's Potential
Artificial intelligence is developing at a rapid pace, and model training has always been its core process, with optimizers playing a crucial role in adjusting model parameters to improve performance on data.
In the past few years, AdamW has been the mainstay for training large language models.
AdamW has enabled giants like GPT, LLaMA, and Qwen to learn both steadily and quickly.
However, as model parameters have increased from hundreds of millions to hundreds of billions, and training time has extended from days to weeks or even months, AdamW's limitations have begun to emerge - its efficiency is being challenged in ultra-large-scale scenarios.
Further improving AI capabilities requires larger models and more training resources.
But computing resources are costly, and prolonged training time can slow down research and application progress.
Therefore, developing more efficient optimizers is not just a technical pursuit, but an urgent economic and practical need.
Then Muon "quietly appeared". Although it has not yet become the industry's focus, its unique design and excellent performance suggest that it could be a major foundational innovation in AI model training.
And this significant innovation did not come from a famous paper or renowned team, but was merely a "practice" by Keller Jordan.
The "Chaotic Status" of the AI Researcher Job Market
Many AI research PhDs seem to have fallen into a misconception that publishing papers at top conferences is the ultimate goal.
There was a time when publishing a paper was equivalent to generating impact!
ResNet, Seq2Seq, Adam, Attention, Transformers, MoE all emerged in the form of papers.
The real mistake is failing to realize that this situation no longer applies.
Publishing articles ≠ Influence.
Muon was just a blog post. It helped Keller enter OpenAI, and he may now be using it to train GPT-5.
Keller is not an isolated case!
Even without a doctoral degree, one can join OpenAI. Yesterday, James Campbell officially announced abandoning his doctoral degree to introduce memory and personality to ChatGPT and AGI.
The traditional peer review cycle simply cannot keep up with modern artificial intelligence research and development.
Of course, peer review based on artificial intelligence may still be necessary.
Open source is like the new peer review. Real-world adoption and reproducibility are more important.
Unfortunately, in academia, the incentive mechanism is somewhat misaligned. Scholars need to demonstrate "evidence" to advance their careers (promotion, funding support, peer recognition).
And the most valuable form of proof is publishing papers at top conferences.
It cannot be conclusively determined whether top AI companies' talent selection has shifted from purely looking at academic papers to comprehensively assessing papers, engineering, and community performance.
But as OpenAI officially states, they "do not rely solely on academic qualifications, but focus more on actual potential and skills".
Regardless of the path, the key is to produce solid results (whether papers, code, or projects) and generate substantial impact.
References:
https://kellerjordan.github.io/posts/muon/
https://www.51cto.com/aigc/4707.html
https://x.com/Yuchenj_UW/status/1934291648542126580
https://x.com/kellerjordan0/status/1890178773586489716
https://shital.com/blog/tweets/thread/202410131001-adamw-who-new-optimizer/
This article is from the WeChat public account "New Intelligence", editors: Ding Hui, Peach, published by 36Kr with authorization.