#238 - GPT 5.4 mini, OpenAI Pivot, Mamba 3, Attention Residuals
2026-03-26 06:00:00 • 2:00:49
Last week an AI would like to thank ODSC AI for being a sponsor. ODSC is one of the longest
running and largest communities focused on applied data science and AI. It started
over a decade ago with a simple idea, bringing practitioners together to learn from people
actually building and deploying models in the real world, not just talking theory. On
April 28th through the 30th, you can experience it yourself at ODSC East, 2026, taking place
in Boston and virtually there will be thousands of hybrid attendees ranging from data scientist
ML engineers, AI researchers and technical leaders. You can attend over 300 sessions covering
LM's, Gen AI, computer vision, NLP, data engineering and more. You can also go to hands-on
training with workshops and boot camps taught by experts from companies like OpenAI, Hugging
Face and Video and other top companies and universities. And of course there will be a massive
expo and networking opportunities great for startups, hiring managers and AI tool builders.
It's one of the best ways for AI practitioners and teams to stay ahead of a field that learn
from a best and connect with a community. Go to odsc.ai slash east and use promo code
LWAI for an additional 15% off your pass to odscai east 2026. That's odsc.ai slash east
and use code LWAI to get an extra 15% off on the number one AI builders and training conference.
We'd like to think factor for sponsoring last week in AI. Not related to AI but I am
personally a big fan. Often I have no time to cook and factor makes healthy eating easy.
They're fully prepared meals that come from their dishes and crafted by chefs. Actually
used it for many years and I think you can really eat well without the planning or the cooking
using factor. They use quality functional ingredients like lean proteins and colorful veggies
and there are no refined sugars or artificial sweeteners. You have 100 rotating weekly meals
to keep things fresh and you can choose types of meals like high protein, calorie smart,
Mediterranean and others. It's really convenient ready in about two minutes. In my experience
really is fair to say that there's no prep necessary and it's quite good. I've used it
for many years and I think you might want to consider it if it fits your lifestyle.
You had to factor meals dot com slash LWAI 50 off and use code LWAI 50 off to get 50% off
and free breakfast for a year eat like a provisement with factor. You serve coverage only
veg by plan. What free breakfast time per box for one year while subscription is active.
At Arizona State University we're bringing world class education from our globally acclaimed
faculty to you. Earn your degree from the nation's most innovative university online.
It's a degree better. Learn more at ASU online dot ASU dot edu.
Hello and welcome to the last week in AI podcast week in the air chat about what's going
on with AI. As usual in this episode we will summarize and discuss some of last week's
most interesting AI news. You can also go to last week in dot AI for our newsletter
with even more news every week. I'm one of your regular hosts, Andre Crankov. I studied
AI in grad school and now work at the startup AstroKate. I'm your other host, Jeremy Harris,
from Gladstone AI, AI national security, AI loss control, all those fun things. By the way,
special thanks to Andre for recording at this time. We bumped things up even earlier.
I think it's like what is it seven thirty year time? Is that it is here on the PST. Sometimes
it's nice to be East Coast to be a bit later. But you know it's good to get your day started
early sometimes. Yeah, much appreciated. Also appreciate people tuning in or watching the entire
last podcast, which featured a extra like hour plus, which I didn't realize it was that long,
but Andre had a hop off. So I just went through like some of the technical papers we didn't
cover as an experiment. And man did I go on. So I have learned that I need Andre to be like the
regularizing term to my loss function. If that is. Yeah, we do know some people are very much
fans of the coverage of research and I was going in depth. So feel free to comment on YouTube
or elsewhere. And yeah, say if you want more of that, you've considered maybe having additional
episodes that are just research, very much happening there. But feel free to let us know if you'd
like even more research on a regular basis. And just to give a quick preview of what we'll be
doing this episode, not as much research as last one. There's a bit of everything I suppose. There's
a couple new model releases and some other kind of interesting tools. There's some interesting
developments on the business front with OpenAI. And then we've been covering a lot of safety and
interpretability work lately on alignment. So we'll have some of that. And then towards Van,
we'll have some fairly interesting impact for looking research. Let's feel radical more sort of like
wow, this might actually be a big deal. So it should be a pretty fun listen. And let's kick it
off with tools and apps. First up, OpenAI, they have shipped a Jubilee 5.4 Mini and Nano. They
are similarly to other small models that we've seen in, in recent times, like actually really
quite good for being kind of a smaller range. Jubilee 5.4 Mini is close to Jubilee 5.4 on
several benchmarks, including on SWE Bench Pro and OS World. They're fine. And it's more of
and twice as fast. Jubilee 5.4 Nano is obviously the smallest option. It's not really doing great
on the benchmarks, but it is super quick. These models have 400,000 token context middles,
so fairly substantial. But they do cost a decent amount, relative to GP5 Mini. Looks like
GP5.4 Mini costs free X, GP5 Mini, and GP5 Nano also costs more. So on the whole, good,
faster, smaller models, if you need something that's doing better in GP5 Mini, now you have that
option. There's been a lot made about the cost situation and the pricing of the per token pricing,
I should say. It is higher. There's no question, right? So GP5.4 is basically three quarters of a penny.
Sorry, three quarters of a dollar per million in put tokens versus 25 cents for GP5 Mini. So
that's a three X hike. But opening I says it only burns about 30% of the GP5.4 in codex. So it's
actually going to be much more token efficient. And this is a metric that I think matters a lot more
than many people will tend to realize, right? So we have lost and like cost for token. But as we've
seen, more tokens at inference time does not necessarily mean more performance. And that's the big
catcher. So when you multiply those together, right? 30% times three acts, you actually get a slight
decrease in what you might think of as cost per performance, which is a little closer to what
most people care about. And this will vary depending on the workload, but still quite interesting,
right? So for the sort of orchestrated agentic tasks, the effective cost per outcome could actually be
favorable compared to running the full model. So it's a sort of interesting there. One thing I will
say nano is API only. This one is priced at 20 cents per input and $1.25 per million output tokens
versus like much, much cheaper. So it's predecessor was 5 cents per input token. That's a 4x lift,
roughly 4x again for output tokens. Well, so you're looking at 4x overall. The weird thing is,
so opening I is pitching this for classification and data extraction, which are these like very high
volume workloads where you're usually quite cost sensitive because you're processing you're so much.
And so that four full hike is going to sting the most for exactly the people who are being pitched
this product. So it's a bit of an interesting position. It seems like a little, I don't want to say
at odds with open AI's position that they want to make intelligence too cheap to meter, but it
certainly is at least locally. This is a move towards instead of racing to the bottom on inference
costs. We're going to focus on model quality. And that that's going to be our big differentiator.
You're going to care that we can get the right answer, not that we can get it cheaply,
which is where all the margin is. That at least is what anthropic certainly suggests and what
we're seeing elsewhere. So that's kind of interesting. You know, there's a bunch of interesting,
as you said, OS world verified is interesting benchmark to look at here specifically. I know you
mentioned sweet bench and a couple others. And so this is basically a computer control benchmark.
So it looks at how well can the mini model just control a computer. And what we see here is a
GPT 5.4 mini hits 72%. If you look back at GPT 5.4, the full version, it hits 75%. So it's
actually pretty close. And if you're thinking about the previous GPT 5 mini, that was only 42%.
So pretty big jump, especially given that we're getting up there on this benchmark in terms of
saturating it. So all in all, pretty interesting release, the very lead, not very lead, but the
very detail here really is that token efficiency question. What kind of workload are you going to
use this for? That's going to determine, I don't want to call it like the total cost of ownership,
because that's not quite the right metric here, but the total cost that you're exposed to in
the ROI, that's really becoming a key thing here. Right model for the right workload is just going
to be a critical dimension, at least for the next few months. Right. Yeah, I think these kinds of
things showcase the fact that, you know, all these models kind of came out of the world of academia,
right? And benchmarking is largely focused on capabilities on how accurate your model is.
And so usually you don't highlight these kind of more practical concerns of how quickly can you
finish a task, right? How cost effective are you like you do a task, how much dollars does it take?
Wall clock time, right? We just don't get these numbers, at least, on the announcements. It's
honestly a bit surprising that's still the case, but it's a culture question, I suppose.
And as you said, we'll be discussing opening a strategy in just a bit in the business section.
Very much in line, we're on topic where we're like, we'll just charge more for our models,
but they're the best. And so people will follow it. And speaking of small models, next up,
you've got me Straul. They have released their small four family of models under the Apache 2.0
open source license. And it combines actually multiple things. So it has reasoning built in.
It has multi-volonial capabilities built in and it has a genetic coding optimization. So they
combining magistral, pickstrawl and devstrawl. It's also used as a mission of experts. So they've
have a total of 119 billion parameters, but only six billion active parameters per token. That's
quite small. You can fit it into probably one top end GPU. You know, it's actually quite affordable.
So it looks like possibly a pretty strong model on in this category of smaller, faster,
cheaper, and open source. Yeah, challenge there is going to be, you know, if you get into
wanting to fine tune it, obviously, then, you know, now you're you're dealing with really a 120
billion parameter model. And that's, you know, pain in the butt. But yeah, I mean, the only six
million active, I mean, this is really ultimately a bet that you're going to have kind of like
model sparsity in the sort of very aggressive sparsity ratio is going to perform a dense model or
something, you know, more traditional. And it's a bet as well as you say on the kind of hardware
that is available at least for inference on local machines. So yeah, it's quite interesting. It
is a more aggressive kind of fewer active parameters per token type of play than we've seen before.
It's also kind of interesting. So you touched on the consolidation, right, of all these models under
a single, in a single model, right, reasoning, coding agents, multimodal, that's a pickstrial,
like looking at images and text and so on. So this is a weird play because historically,
Mr. Alt has separated these, right. And so given they're collapsing them into one model just with
a reasoning effort dial, you could see that as a pretty big bet on where the industry is heading.
That is something that we've seen with, you know, obviously the O series of reasoning models
with GPG 40 even starting as far back as that. If it's the case that we get positive transfer,
that's really what this is a bet on, right, that a model that is trained to do all these things will
do better at each individual thing because it's kind of getting cross training, right, the same
way that you might want to do, you know, football and like ballet at the same time, rice hockey,
and that you get better at each different thing because of the combination. That's kind of the idea
here, right, that positive transfer that for so long was was really difficult to pin down. People
were seeing negative transfer where the more stuff you train a model on, the worse it does on the
marginal additional task. We're now well into positive transfer territory. The extension of that
to reasoning is quite interesting and implies that reasoning may, at least that they're betting,
that reasoning may eventually were already play a role in analyzing images more successfully for
these kinds of models at this scale. So there's another piece here that's interesting, you know,
this efficiency claim. They say that small four is achieving scores that are like on par with GPT
OSS 120B on a bunch of benchmarks while generating like shorter outputs on at least one benchmark.
And so again, this is about that token efficiency question, right, we're sort of like back at that
very underrated metric of output length efficiency and the fact that you don't necessarily get more
benefit by reasoning with more tokens, that's something we started pointing out when we looked
early on at the kind of deep seek reasoning results, right, where it's like, yes, you do get this
positive uplift, but your value per token is actually like potentially going down. And we are
now, in fact, in that regime, we're seeing that very clearly. So this efficiency piece is really
important because if people are going to use open source models, they're going to run them like by
far, the biggest use cases running these on whether you're own or other people's clusters to serve
customers. And so, you know, like how good Mistral is at making this an efficient reasoner speaks
directly to your bottom line as the person who's going to be serving these or asking somebody else
to serve them for you. So pretty interesting, you know, the fact that they're comparing to GPT OSS 120B,
look, the space moves really fast. That is an old model at this point in open source terms.
It's kind of like choosing your point of comparison pretty selectively, I would say here,
but they do a comparison that's reasonably favorable on point models as well. So it's an
interesting play. I think, I mean, I'm a little concerned from Mistral. I have been for a long time,
obviously, don't know too much how this ends up playing out with the open source play, but here
they are. It's a reasonable model and the consolidation angle, that's a really big story here, right?
If everybody is starting to consolidate, even open source players here around one model under one
roof, that's a materially different story. It does not, by the way, extend necessarily to the world
of agents, right? Sub agents may be smaller, cheaper models. This is more about, you know, if you're
looking for a highly performant model, I think of the main orchestrator, you're probably, it seems,
you're probably going to be seeing, you know, models that that kind of put, put multiple capabilities
under one roof. So something to watch out for. Right. I'll just quickly comment on the
unification bit. It's better to say that it's partially a bet on what things are heading with
a multimodal aspect. The fact that they have baked in reasoning and coding, I think, is a little more
just an indication of catching up with where things are these days. It used to be the case that
you had a reasoning model, like, oh, free. And with deep seek R1, you trained a reasoning model
and you had your base model. And what everyone moved to in 2335 is there's no reasoning model.
Your model has reasoning baked in. And now with, with Sonnet, with GPT, really post training for
reasoning, it's been very clear that you should just train your model to be a good coder because that
makes it a smarter model in general. So this is a mix of catching up where our rings are at
and also adapting that multi model capability, which is could be interpreted in several ways.
Next up, Meta's Manus launches my computer to turn your Mac into an AI agent. So this is something
you can install and launch on your computer. And it's effectively like having a little
open claw, I guess, on your computer. So it can execute command line instructions that
sit interact with computers. So it's very much, I think we've seen this happening more and more
where various organizations are shipping open claw, ask things where you have an agent that just
lives somewhere and you can tell it to do stuff. And it is your like assistant or your AI or whatever.
This appears to be another instantiation of that similar also to perplexities announcement of
what was it like perplexity computer. Yeah. So very much in line with that. Oh, you mean you were
enabled to remember like the sixth new open claw variant that got launched by a company in the
last two weeks. Yeah, it's crazy, right? We're really seeing more and more of this like
pylon right from all these all these different competitors in this space. And this is a land grab
like Magnum mistake. This is the the sort of man, I don't call it the scramble for Africa moment,
but the you know, historically equivalent of that in this space with fewer controversial overtones.
This is the moment where people are realizing, hey, you know what? We need to get on people's
local machines, right? We have to get some kind of piece of that pie because so much of what's
what's happening right now is that we're trying to like grab onto people's like agentic like the
the kind of agentic runtime layer. In fact, in video, we'll talk about this in a minute with
Neemoclaw. It's the same thing. Everybody's trying to get on to like what is what is the substrate
on which agents are going to run? Can I get my dirty little hands on that and and turn that into
part of my market? And in this case, the menace play is quite interesting and it's aging well,
right? I mean, menace was an in it like totally independent company before the acquisition.
And really the play here is meta extending its reach into that that local OS layer for the first
time. It's this is the territory historically of you know, your apples, your Microsoft, your
Google's. That's the war they're entering. We haven't seen meta do that before, right? We just
haven't seen them try to play in the operating system game. This is a really interesting way for
them to vector into a completely different market using resources that you know, maybe for the first
time they have like a credible, whether you call it advantage or not, they have a credible play here.
So pretty interesting. And again, meta classic history of buying their way into competing, right? We
saw this with WhatsApp with Instagram like it, it never ends. This is their play in that direction.
So, menace, you know, maybe aging well, we've got to see what comes of this.
came out I believe last year initially with a cloud based agent but you could assign to go do
stuff so this is them extending to your local computer right now you can I think get this
for silicon based max the other aspect of this by way is not just open claw it's the core work
the codex angle where they most of the blockpost actually is highlighting like let the agent go
to computer and organize files and do things that are very much cloud code or or now core work
ask and we kind of tell it to do stuff from anywhere at any point is an aspect of it's the
opaquaspect but I think the real land grab is for that core work type like have an agent do stuff
for you which is now like everywhere in coding but I think what or on topic and now open AI
and now everyone is realizing is these agents can like do a whole bunch of stuff and that people
haven't adopted to do yet and speaking of open claw next up and video has announced Nima claw
as part of their announcements at gtc which is a little bit of funny this is a stack for the
open claw agent platforms that allows you to install and video name-atron models we just discussed
this latest name-atron model last week and there's a new and video open shell runtime you
install both of those in a single command and you get privacy and security controls baked in
making it possible to have you know more more confidence in running one of these things we've
seen many stories of open claw go in raw it's right absolute confidence yeah so open shell provides
an isolated sandbox that enforces policy-based security network and privacy guard rails for
agents seems like a good idea if you are to do one of these agents maybe install them in a sandbox
where you can control them and they don't go rogue and you know take over the world yeah and one of
the the key dimensions here you know you think about what does sandbox mean you know how do these
things work typically a lot of these things focus on what is the model that is on the running on
the cloud and what are the models for model that's running locally on your machine right so you
imagine you might not want the model that runs locally on your machine that's actually looking at
your own intimate files to have direct access to the internet right so that's kind of like one way
that you might enforce that sort of guard rail so use that local model just like generate summaries
or do analysis and stuff and then just ship the summaries after some review or something like that
to you know that's like that's one way to to play that game and there are a whole bunch of other
guard rails around the kind of access that different models can have in your ability to net so that's
kind of like a lot of where this is coming from you know this is a classic example of jensen's
hyperbolic rhetoric right he's he's using terms like new renaissance and software to kind of to play
this up which to be clear I actually I mean I agree with this but there's like there's a gap between
you know this framework can help you install you know a single demand and redefining how computing is
done and we'll see if that that chasm gets gets crossed and it it probably well at some point man
I think it's certainly well at some point question is who does it first so the frame here is you
know again that operating system piece right keep going back to this it's not a coincidence that
everybody is is going on the same basically this gold rush expedition everybody's thinking the same
thing the operating system for personal AI that layer and again jensen here comparing mac and
windows for PCs to open cloth for personal AI that's a deliberate a bold clamp a very deliberate
this is part of that frame this is Nvidia now saying hey meta is going to get into the effectively
the operating system game the the sort of operating system for agents we're going to do the same
thing we don't have history quite of of doing that but you know now we're diving into so you're
creating creating this environment where because essentially agentic you know we had the
software eats the world era of the sort of sass revolution you know in the last decade decade
and a half two decades and now we're in the AI is eating the world including software and another
layer of abstraction here is yeah that runtime environment for agents and and that's the land
ram that's the operating system at least this is the case that jensen's making I think it's a
reasonable case on the whole but remains to be seen this is functionally also a classic Nvidia
play of like trying to trojan horse in their full stack so nemoclaw is going to install nematron
models right those Nvidia models preferentially and the open shell runtime all in one command the
simplicity here is the point so it's super super easy to deploy Nvidia's own models together and
the runtime environment all that stuff basically they're creating a situation where just like they
owned kuda for the the GPUs they're creating a whole software stack around agentic the agentic
runtime environment and so this is what's worked for them in the past create a whole ecosystem around
this software is more commoditized now that it has been before so that may not be the same kind of
moat especially given that they're you know unlike before with kuda where they had like a decades
head start for anyone really paid attention this is now you know much more in competition with
kind of faster moving players so really interesting again chocolate up is another another
entrant in the sort of operating system for the agents sort of catalog here right I do think
interesting aspects is nemaclaw is is kind of like the hype really like good cool branding for
the actual major push which is their open agent development platform which is that terminal
based thing it also comes with this Nvidia AIQ blueprint thing which is built on top of
length chain deep agents and has Nvidia Nemo agent toolkit as these open source kind of options
for building your stack of agents so nemaclaw aspect is like digging back on the thing that
everyone is hyped about and then the actual software frameworks is probably the actual thing
that Nvidia cares about yeah and this is it right it's like how can I hook on to one big hype train
and then exactly use that as a Trojan to get her like everybody dependent on her whole stack including
their models because there hasn't been the kind of uptake of the nematron models at least that I
expected to see so far and so this is you know one one opportunity for them to make that happen yeah
I think to your point about software kuda obviously is is the number one thing for GPU related
kind of execution software but as far as open source packages go and video hasn't had a
history of really making an impact so for instance length chain is an open source package that
goes back to 2023 that has been a popular for building complex graphs of LLM prompts and agents
and so on probably overly complex but anyway it was at an gained broad community adoption
so that's more or less been the pattern a lot but with agents and now open claw and whatever
that was a good time to try and get in on that open source kind of game and one more story about
in video at gtc they announced dlss 5 which is what they're describing as the GPT moment for
graphics so this is runtime enhancement I suppose for game graphics so this uses machine learning
based upscaling and it applies generative AI to add a whole bunch of just really nice looking
graphics so if you look they have various examples so you probably want to just see it to understand
it basically you can go to older games like other schools oblivion for instance and get really
cutting edge seeming graphics with this turn down now the reaction to has been very much split
from what I've seen people have often sculpted it and we're like oh this is AI slop this filter is
is bad in some cases it seems to go against the style of the game somewhat so Nvidia did also
emphasize that this is fully controllable by the developer the developer can set the settings
of how aggressive it is you know how much it impacts various aspects of the rendering yeah I think
this is the kind of thing that it has given developers presumably in the past this is dlss 5
but with the generative AI aspect of it it's getting a lot more discussion and it looks
potentially very very significant well so I was going to ask you I mean this is the part of the
show where I ask you for your opinion as a guy at astrocade like is this something that you guys
would integrate in your stack like what how do you how do you think about new tools like that or
is that even part of your your workflow yeah so this this is kind of targeting that 3d
triple a big game kind of market right so this is dealing with new games right then and more
complex you wouldn't run this on your phone like for casual games but for games that have complex
character models with faces and stuff like that or like open world games where you traverse big
landscapes that's where the kind of photorealistic pass really makes a big difference and last up we
have an update on open areas plan to launch strategy parties adult mode this has been announced
I think last year as I think they are thinking of doing it has been delayed from the original
late March target and it seems that they're still aiming to do it the news here has been that
the team within open AI their advisory council of psychology and neuroscience have opposed it at
a January meeting one advisor really warned about it significantly so anyway we have a quick update
saying that they appear to still be planning it it has been delayed but as of now it's still
going to be presumably released yeah well what a surprise that despite objections over the
appropriateness of a tool like this that they still went ahead huh that's that's weird that's
not my company at all isn't yeah that's weird what a weird thing I know this is when you think about
the things that were classically warned about just my opinion here personally but I seem to remember
an awful lot of people warning us about the brave new world thing where you're hooked up to
like the dopamine drip yeah I don't know ultra porn powered by like AI super intelligent like
I'm not sure how far you push the inference time compute budget before you get to that scenario but
hey how about scaling laws for porn addiction how about that paper looking forward to that coming
out I mean look that that's the kind of thing we're gonna have to see there's gonna have to be
studies on scaling laws for porn addiction just calling a spade a spade like I don't see how we
avoid that it's interesting that this is this is an offering someone's gonna do it right the big
porn companies at some point whatever so you could argue and I'm sure this is part of the ethical
argument that opening I might make here is look this way we can monitor the use of these things
ahead of time understand how people adjust to this technology maybe try to mitigate risks ahead of
time whereas porn companies probably wouldn't do the same thing sort of a similar argument to what
Sam's been making historically right we want to like get these things out there because it's
gonna happen anyway so we want to get that feedback and be able to iterate on it which there's merit
to but yeah we should be under no illusions that we're crossing the Rubicon here and I think given
that open AI is taking this step it is on them then to come out with unbiased research that
transparently looks at the effects of exactly what they're putting out there right I think that's
just a reasonable thing I think any I'd be pretty sure it's like a tobacco company going out
and doing something when you're playing so close to the the base of the human brainstem you're like
you're in different territory and so anyway it we'll see if this persists we'll see what scandals
come of it I'm sure there will be some but yeah it's in some ways not surprising I don't want
again rip on open AI too much for this because the argument is true that at some point you know the
big porn companies will do this and they have a history of like mistreating women and you know
doing all kinds of awful things so ethically I get it but now the proof is gonna be in the pudding
because once you put that out there and brand-wise like geez I well so I think it's worth clarifying
a little bit porn might be a bit strong of a term here right so they do clarify this will not
generate erotic audio images or videos this is really about so they've banned erotica as a
as a general category with the chatbot and so if you try to write sexy stories or have sexy role
play which you can do and plenty of other you know apps out there is a million AI girlfriends
including XAI is right so this is more of that this is like role playing with sexual kind of adults
only component to it not straight up like explicit content of the sort that you might think
when saying porn so in some ways yeah I think there's arguments to be made out of a way like people
have tried to do this and tried to do this as is most likely with these models people do seem to
want it there's very large adoption of these things and we already have plenty of stories of people
getting AI girlfriends and having psychological impacts so yeah maybe actually if chat GPT and
open AI do this explicitly and do it right and do it well it's better than not doing it and having
some shady dark market for it you know no absolutely and I take back the the use of the term porn
here I guess I'm sort of seeing where the the puck is headed and anticipating that next play
this is definitely a step where we need open AI to actually run the studies since they're going to
have access to the data and no one else will right so like we we need people to look at this I hope
that they will I would imagine this as part of the package here but from a branding standpoint this
is really dicey especially at a time when they're trying to redouble we'll talk about this in a minute
but on this whole sort of business productivity focus like that's going to be their big play so you're
kind of adding that into the mix it's interesting I mean what what this implies about the user base
that they're they're going after is is there ROI here I mean porn is a pretty low margin industry
as I understand it's so so this like I don't know if role play erotic role play would kind of
be similar things flirting if you will with that but either way yeah we'll find out I think the
argument that a lot of people want this is also kind of dicey I mean wow a lot of people want crack
cocaine like it's a so we got to find a way now everything's out of continue I'm being tongue
in cheek here obviously but we have infinite inference inference time compute budgets coming online
in the next you know three to five years or whatever at what point do you just have so much inference
time computing dedicated to your your limbic system that for all intents and purposes like you're
robbed of your agency I think we're actually going to be closer to asking ourselves that question
sooner rather than later than most people are pricing it right and then just to reiterate the
actual news here is just that there have been these details coming out of the internal objections
from certain parts of open AI and these objections have been there since January and the company
hasn't canceled this feature features it out they could still not do it and it has been delayed
so it's still on track and it'll be interesting to see if they do go through it after all
yeah I know that's true and actually like a lot of my reaction is reactionary I'm like
this this direction generally is definitely coming but it makes me nervous as hell and that's kind
of that's sort of my my response to this is you know we're just sort of seeing a lot of this is
testing the waters too I'm not saying this particular story was leaked but you see governments do
this you see private companies do this there's like leaks of stuff to go out to just gauge audience
kind of consumer response to things can we take that step can we not I don't know again if this
is or is not the case here but but man I mean that you'll at some point need some kind of immune
response to things that push in this general direction and on to applications and business
sticking with open AI and coming back to a topic we briefly touched on another thing that came
out from open AI this week is a discussion that happened at an all hands where effectively open AI
has signaled a strategic shift away from doing everything doing a little bit of everything and
focusing more on productivity and business so you know open AI has historically had a million
side projects this I think is how they characterize it side projects they have video models they
release the soar app they have a browser they have audio models I believe transcription like
a million things on fabric has clawed and that's it open AI has sorah and their audio models and
and so on and so on so it seems to be that internally within the company they are changing
focus the head of applications Fiji Simo has told staff that they cannot miss this moment because
they are distracted by side quests and they need to nail productivity and you've seen this already
for the last several months it has been apparent that they have been really pushing on codex
to catch up with cloud and and co-work as well and they largely have at least in terms of
feature set in terms of adoption I mean many people are starting to use codex some people prefer
codex at this point but it is obviously true that this hasn't been open AI is primarily focused
until recently and they have kind of been behind the cloud and in terms of adoption they are
still behind because cloud code is the first early leader in the category so yeah interesting to
see open AI having that internal discussion now I feel like this has been a problem with open AI
for a while if I were to kind of guess at internal dynamics and kind of in business and company
level issues that lead to poor performance so yeah kind of could see this coming I think famously
and and Andre I'm sure you'll have friends that have told you the same thing like at Google
everybody in mind who's ever worked at Google says the same thing and actually same meta you get
promoted for building new stuff right it's like it's not about did you make the code run more
efficiently did you clean this up just clean that up it's like did you make new stuff and that's
why there's a massive app grade of yard right famously for Google products you know Google hangouts
Google this Google that that just like gets axed in at various stages there's this fundamental
There's this fundamental question of like,
again, is this a feature bug, right?
You can look at Google and you can say,
ha ha, look at the grade of yard of waits a time.
You can also look at Google and say,
well, what matters is not the misses,
what matters is the hits.
And for every, not for every Google Hangouts
or dead Google product,
there's like a Google calendar or a Gmail, right?
Or maps, right?
Or maps, that's right, that's right.
So like Paul Buhheit, famously, you know, now a YC partner,
I don't know if he's left, YC, he was there when I was there.
But he like was the founder of Gmail
and his entire like claim to fame and Silicon Valley
is that he was the Gmail guy.
That was a company within Google.
That's the way to think of it.
And certainly from a revenue standpoint,
that's what it is, right?
So when Sam, who comes out of the YC ecosystem too, right,
having been the president of Y Combinator for many years,
the first one after Paul Graham,
his whole thing is spray and pray, right?
He's used to betting on us like a little bit of money
on a lot of efforts at the same time.
This goes way back to OpenAI's founding days, 2015-ish,
where he was doing the spray and pray thing
on a whole bunch of different things, you know,
evolutionary methods and robotics and RL and all this stuff.
You know, OpenAI's gym was originally a side project, right?
Now it's a big thing.
So there's a whole bunch of different plays
that you know, he sort of like used to playing in this way.
It's worked from the past.
The challenge is now you're entering a much more mature environment, right?
You're no longer necessarily in the game of betting
on a bunch of different startups.
You have a kind of core area that you need to dominate right now.
You know, anthropic is running away with the pod.
I think it's over 70% market share right now in enterprise.
So that's what is the red alert here
that Sam Altman is calling in Fiji Simo.
So they've got to find a way to kind of like,
yes, you got to keep that Google or that YC kind of approach
to investing a whole bunch of things.
You never know when the next big thing is going to come from.
But you also have to be able to invest and double down
in things that are working that are generating a lot of your revenue.
I mean, 25% market share on enterprise is a big, big deal.
You know, they're making what, $25 billion of annualized revenue
at this point.
Like they have stuff that's really working.
And so this is a structural shift.
There's another piece that's happening here where I think there's
an over rotation on, oh, well, opening eyes like, you know,
kind of throwing out all these side projects.
Sam wrote on X a couple of days ago.
He's like, look, there's all these rumors
that we've also canceled the hardware thing with Johnny Ive.
And in fact, this article sort of repeats that rumor
saying that, hey, they, you know, they basically canceled the roll
out of these AI powered earbuds.
Well, actually, at least Sam says the Johnny Ive thing is still live.
And so again, it's a question of what about the hits
not necessarily just the misses,
but some of the hits now are worth protecting.
And that's a challenge.
If you start to grab the pot, Sam needs to find a way
to hold onto it and expand his territory.
It's not just a land grab.
Right.
And I think to be fair, Cloud Code initially, the story goes
was a single person's kind of side project
and it turned into a massive thing because someone pursued it.
Really, what this perhaps indicates
is the various bets that open eyes have been making
with the browser, with the SOAR app.
They haven't seemed to really be a hit or like BS.
Nearly as transformational as Cloud Code.
So they missed out on the Cloud Code moment.
They started catching up late with Codex.
It wasn't until kind of pretty late into last year
that I started feeling like they're really taking it seriously.
Already by mid year, if you were kind of feeling a pulse
of where AI was like people were calling Cloud Code.
And you know, I adopted it sometime around May or June.
Codex didn't really start kicking off
until maybe even November or later.
So yeah, it really signals kind of more,
not necessarily not doing these other things,
but realizing that they were left behind
and now we need to catch up and start making money
with businesses because that's as we often say
where you make your money much more easily
than with consumer products.
Yeah, much more, you ultimately need to go straight
to consumer to get the big pot down on others.
But it's early in the get like it is always early
in the game and AI and to your point,
I mean, this is, and I can't remember if this is a Napoleon thing
or an Alexander the Great thing,
but people would say like he could achieve a victory
but not use it.
This is kind of that, right?
There's two different modes in the space.
First is getting your product market fit
and then it's holding on to it in the face of competition
and then kind of attacking other people's modes.
And that second one is it's a shift.
It's a fundamental kind of mindset shift
that you got to get into and it looks different for AI
and for traditional SaaS companies too.
Like you just can't sit on a stack of software
for like a month and expect it to hold on to market share.
So yeah, it's a really interesting space.
We're learning a lot in real time right now
about what it takes to gain market share
and hold market share in a world of low switching costs
and very rapid advancement.
Next up, moving back to GTC.
We have some more details.
One other thing that Jensen Huang, the CEO and video
has announced is that it looks like purchase orders
for Blackwell and Vera Rubin chips.
These are the newest generation of chips.
I expected to reach $1 trillion through 2027
bubbling last year's projection of 500 billion revenue
opportunity.
He's saying that the key that's driving it
is a shift from chatbots to agent AI applications
which do require much more compute.
You're just generating a lot more output.
And if you have open cloud, which isn't always on agent,
but you let go off and work for hours at a time.
And this is where things are heading, at least if you're
kind of a believer in the trends, right?
They've been half a year, a year.
We could expect agents just working on their own
for many hours at a time, even days at a time.
And so the bet is that it's going to keep happening.
They also unveiled the GROC free language processing unit
which is the first announcement related to GROC
since their 20 billion asset purchase in December.
GROC, their language processing unit,
has been a very impressive piece of work.
GROC has really high throughput time
and it is just a great option for running inference
if you need speed.
So that I think is also quite significant.
Yeah, that's a rate of 20 billion dollar purchase
of GROC by Nvidia a few months ago.
And yeah, I guess that was December.
Huge deal obviously.
Already, they're looking to ship a product
in the third quarter of this.
Yeah, that's fast.
Under a year from acquisition
and they're already fully integrated
and pumping out stuff just as a reminder here.
So when you think about GROC's LPU,
language processing unit,
go back to our hardware episode to do the deep dive on this.
But basically, you traditionally have in any kind of GPU,
like the Nvidia GPUs,
you'll have a stack of high band with memory, HBM,
that's going to sit next to the logic die.
And the logic die is what actually does the matrix math,
the computations that are interesting, the number crunching,
but it has to pull data from that high band with memory,
the stack of HBM that's next to it.
And pulling that data takes time
because they're physically just like different objects, right?
The HBM is a stack and then you have a logic die
and they're packaged together,
but you run into this challenge
and it just has to travel to the memory,
grab the data and come back that creates that memory wall
where the processor spends like 70% of its time
just waiting for stuff, right?
So you've got a kind of cold starting issue there.
And the GROC, like language processing units,
the LPUs, they use a kind of memory called SRAM
that is built directly into the silicon.
So the logic and the memory are much more intimately linked.
So the data doesn't have to travel, it's already there.
So you get this massive increase in sort of like internal
memory bandwidth, so how much memory,
how much data per second can flow
between the relevant components,
some like 10 to 20% faster on the GROC units.
So this is really, really important
because the Nvidia is going to be integrating this
with their viewer Rubin architecture.
That's the next generation after Blackwell
and kind of doing a hybrid of these two ideas.
The idea is they're going to have HBM
sitting outside the main processor,
but also integrating some of this SRAM heavy compute tiles
that are directly going to be in the Rubin chips.
And so this is a really big merger,
a physical merger of these companies that we're seeing here.
So yeah, it's pretty wild.
It also means a really intense demand for memory
in the memory market.
Right now, I think SK Heinex,
I saw something earlier today
that were like booked out until 2030,
they expect to have basically like way more
demand supply all the way until 2030.
So this is all part of that, right?
People are realizing, holy shit,
like memory is a big deal here.
And Nvidia on the inference side of the things,
you know, I think Jensen called himself
what the inference king or something
at these things where they're like,
hey, so what's going on with this GROC architecture
and he was like, hey, I'm the inference king.
So this is part of that, right?
The GROC play is an inference play.
And next going back to Mistral,
one other aspect of what we have announced
is also at GTC, they launched Forge,
which is a new offering by them
to let customers, businesses train their own AI models.
It sounds like this supports various kinds of training.
You can pre-train and have an entirely custom model
from scratch, which isn't something
you would want for a lens,
but it sounds like maybe they allow it.
They also seem to be offering post-training,
reinforcement learning,
basically optimizing a model
for a given company's needs
with sort of all the stack and training knowledge
and inference and so on,
that is pretty kind of complex
and is what Mistral and OpenAI and ProPick focus on.
A lot is just the training infrastructure
and setup and all of that.
So they pitch it as, you know,
you need the models to have your internal
knowledge and information built in,
you can optimize it for your needs.
I think this is an interesting offer
and potential here with open source models
haven't started to get good,
with especially Clan models, Mistral models,
to some extent, it is a potentially good bet
that as open source models get better,
you would see more people fine-tuning
their own models post-training
and reinforcement learning for their own agents
and their own flows,
which ultimately that's the best way
to unlock performance is training.
You can do, you know, prompted engineering
and you can do whatever,
but training on data is key.
So they just announced it,
you need to sign up to get more details,
the details aren't super precise yet,
but it appears that they are shifting focus
or at least honing in on this specific strategy
to try and differentiate and compete in this space.
Yeah, I'd like to be upfront about my biases here.
Everyone, if you listen to the podcast a lot,
you know I'm pretty bearish on Mistral and Coheer
and those sorts of run-up to the frontier companies
for a whole bunch of reasons,
but this is not making me any more excited
about this direction.
The reason is, so first of all,
this position's done basically as competitors to Coheer.
This is the Coheer play, right?
This is it.
You're basically saying enterprises,
like we're just gonna be better at enterprise integration.
We're gonna make a stack for the enterprise.
The difference is Mistral is a few miles
behind Coheer right now, right?
Coheer has been working on this for a long time.
They hit like a quarter billion dollars
in recurring revenue in 2025.
It was some decent quarter on quarter growth,
though I think ultimately they're gonna struggle
to kind of be competitive in the world
of like big infrastructure plays here,
but that's kind of fundamentally one of the challenges.
The other one is the companies that have
an enduring advantage on infrastructure,
the anthropics of the world, the Googles,
the opening eyes of the world,
are ultimately, if they ever want to,
in a position to just eat Mistral and Coheer's lunch.
They can just go in and one day say,
hey, you know what, we decided that we really care about this.
We're gonna make special models,
maybe distill versions of our actual frontier models,
whatever to run locally on your thing,
or we're gonna create new things.
And opening a has in the past offered,
a fine tuning of the finger GPD 4.1.
They don't allow to have the latest set of models,
but it used to be something to offer.
Yeah, exactly.
And so ultimately, I mean,
this is the challenges you are going to end up competing
with them and the question will at some point be,
because we know this is where margin is made,
quality versus quality at the frontier of capabilities.
If that ends up being the case, I mean, damn,
like this is an uphill battle,
because you're actually gonna be like opening eye
and anthropic get to amortize the massive cost
of training frontier models across every inference run
that gets done on their infrastructure.
And if they're, you know, doing distillates of those models
to serve at the enterprise, the same thing apply to like,
the economics look pretty bad to me for stuff like this.
But ultimately, you know, we'll see,
this is also by the way,
a mistrial kind of playing into their European roots a bit.
So co here has a sort of Canadian footprint.
They've got big deal EU ambitions,
but mistrial is through the French champion.
And they've very much been seen as the sovereign champion.
So data sovereignty, model sovereignty,
you'll hear them say a lot of that.
Ultimately, from a free market standpoint,
that's basically just an appeal to like kind of European
or French nationalism to kind of help float them
on this otherwise potentially market challenge play.
I think, yeah, maybe I would say you're a little bit
caricaturing them,
they position it more as control over IP.
And this is something that people care about.
I totally agree that's the play.
That's an identical play though,
what I'm saying is to co here.
And it's and on that note, co here,
their play or largely what they have focused on
is releasing models that they have trained
with the focus of enterprise.
So they release rag models, for instance,
that they say are very good for enterprise use cases.
They largely have position themselves
as having models and offerings that are useful for enterprise.
They do have a thing for building customized AI solutions.
I don't know how much of it is apples to apples
with this new forge, whatever it is.
That's kind of the thing, right?
So yeah, co here's thing is called North.
It's this whole enterprise platform, right?
It's it's for AI agents and workflows
and then they've got a bunch of like re-rank and bed
and a bunch of other things for rag
that are meant to be enterprise kind of correlated.
But that's kind of the challenge.
It's like ultimately, you're already seeing this appeal
happening quite quickly to nationalism.
I mean, Mistral is wrapping themselves in the French flag.
You see, co here doing similar things
with the Canadian flag.
Ultimately, these are signs of like companies
trying to look for some source of alpha
that's outside of the market.
I'm not saying that OpenAI isn't doing similar stuff
and throbbig certainly struggling to do similar stuff
and still succeeding.
And so this is kind of part of the flavor to me
that like it starts to look a bit like a crutch
and there's only so much, you know,
a capex spend that can be subsidized by governments
that at a certain scale that I think we're approaching already,
it makes it hard to compete.
When your big differentiator is like,
look how French we are or look how Canadian we are,
that is a bit of a character for sure.
But it's an important part of the pitch here.
I worry about that.
And that's kind of part of my bearishness here.
I'm just trying to be transparent about my reasoning here.
I could be wrong, but looking at the scale
of these companies, I mean, man, co here's been around
for longer than anthropic, right?
Like they're, you know, what a percent of anthropic scale.
I think this is a sign that these markets are smaller
than maybe had been hoped for initially.
We'll see.
I mean, they could absolutely surprise me.
And I'm looking forward to seeing how it plays out.
Yeah, I think longer term, I think this is an interesting time
to consider this direction of customized
and, you know, fine tune models for a given business
because the open source models,
the models that aren't perpetuated,
I starting to get good.
And so you might start seeing a case where,
for instance, at Ostracade, right,
we have a pretty specific use case of coding
for these games, right?
And having coding agents with a specific set of technologies,
HTML, JavaScript, et cetera.
So it is feasible that if you were to do reinforcement learning
and post-training on our own internal proprietary data
of, you know, good games that people have made
and chat histories and so on,
we would get a better model and a better performance.
That didn't used to be the case.
It used to be that like the base models just weren't good enough,
just doing prompt engineering was going to be the best you can do.
I could see it heading in a direction
and me straw man may not be the one that owns this.
It could be an anthropic or openly I jump in on it.
But regardless, having these sort of individualized,
customized models for different companies,
agent techniques seems like a very much feasible future.
Yeah, definitely not questioning the existence of a market, right?
I'm just sort of skeptical of their positioning, as you said,
to like, are they the ones to capture this?
Yeah, that's where I'm wearing.
Yeah, it's something to try.
So it's something to try, no doubt.
And back to chips,
China's buy dense gets access to top and video AI chips
according to the Washington Street Journal.
So they are seeing that buy dense is assembling
significant computing power outside of China using these chips.
They are supposedly working with the South Feast Asian firm,
our London cloud and they plan to deploy approximately 500
and video black wall computing systems in Malaysia,
totaling 36,000 B to 100 chips.
So that's 500 computing systems, presumably meaning racks
that amounts to tens of thousands of actual GPUs.
So yeah, it's suppose giving us an indication
that the company by dense being a massive company
that has their products used outside of China, no doubt,
is able to make use of your chips in other countries.
Yeah, yeah, and you know, if you're looking at this
being like, hey, wait a minute, I thought we weren't shipping.
I thought we just greenlit the H200 shipping.
What are we doing shipping black wells to China?
Well, that's the whole point.
It's technically not being shipped to China.
The export controls that apply here from 2023 regulate
where the hardware is shipped, not where the compute is used.
Right? And that was intentional.
It was meant to allow the sort of like global cloud infrastructure
to be built based on American hardware.
But the thing is, by dense is not on the famous entity list
or there's a military end useless as well.
So the fact they're using Nvidia hardware
does not automatically in and of itself trigger restrictions.
The fact is that they're just setting up shop outside of China
and normally that's totally okay.
And Nvidia actually confirmed there were no objections here
and the BIS Department of Commerce
apparently is also on board.
So they're all tracking.
Still, I mean, this kind of reveals like,
hey, well, this is what you're opening yourself up to.
If you think about the scale,
it's by the way for a second.
What does it mean to have 36,000 B200 GPUs,
you know, 500 of these NVL72 units?
So this is typically a two rack system.
Yeah, I think we talked about it in the hardware episode at one point.
So this many GPUs adds up to about a 60 megawatt cluster.
Okay, so 60 megawatts of power.
That's enough to power roughly like 60,000 homes.
You can think of it that way.
So it's like a little small town of compute.
Compare that to Microsoft Abelene is like 100 megawatts
or Elon's XAI Texas Colossus site.
That one's more like 130 megawatts.
The Abelene comparison has been unfair
because that's still kind of being built out.
But the Colossus is better apples apples
because like they're still building it out,
single-purpose, you know, it's stood up, you were fast.
What this is showing is you're allowing a Chinese company
to get up to the same at least
order of magnitude in compute.
It'll admittedly be online kind of like, you know,
later on this year, but that's still something.
So this is all part of the debate here over
what is appropriate, how much compute should be allowed,
even if it's overseas deployments
that conserve Chinese customers by the way.
This is still a data center that can be used
to serve Chinese customers.
So you could, in principle, at least to my understanding here,
might dance could use this to train a model
that then is used in China,
that then because of civil military fusion gets used
by the CCP, you know, the MSS
or whatever other arm of the Chinese state.
So they've also planned additional deployments
of up to 7,000 B-200s at a bunch of data centers in Indonesia.
This is kind of a broader Southeast Asian strategy here.
So we'll see, I mean, it's a pretty big,
you might think of it as a pretty big gap.
That's sort of how I think of it personally,
but there are different views on is this desirable
and you could make an interesting argument
in many different directions.
Moving back to the US,
we've got an update on where meta is at with their AI efforts.
The update is that they are delaying the rollout
of their next model, codename avocado,
from March to at least May.
I think this was actually announced last week,
but we didn't manage to cover it.
So they appear to have just not been able
to train a good enough model in time.
And so they are delaying this release.
They are discussing or have discussed
at least temporarily licensing competitors AI technology,
report of the Google's Gemini.
They already are looking at various extensions.
So overall, not looking great for meta is my read on this.
It's already been close to a year.
It feels like in mid 2025, after a couple of months,
after a long before we saw this heavy, heavy, heavy push
where they got Alexander Reng from Scala AI
and a $14 billion deal.
And then they paid absurd money to get all of these researchers
from Unphropic and OPAI and DeepMind and so on.
And they created super intelligence labs
with a company very reshuffled everything
with the intent to train a competitive frontier model
that can stand up to GPU 5.4 and cloud and all these other ones.
It doesn't seem like they're there.
And I personally, this is my read.
I think organizationally having Alexander Reng lead it
is maybe not the best idea.
Someone who hasn't worked in Frontier AI models,
data curation of Scala AI, you know, that's-
It's unrelated, right?
It's related, but I don't know if that's enough.
Anyway, so that's when it's been delayed.
It doesn't appear like they're on track,
but you don't know too much.
This is all internal.
Maybe they're just doubling down and it's going to be great.
Yeah, boy, yeah, right.
And it looks like it, so reportedly like
landing somewhere in capabilities between Gemini 2.5
and Gemini 3.
So, you know, very passe at this point,
hence this decision to delay.
They're angling for a private non-open source model here, right?
So the appropriate point of comparison
is what does cloud look like?
You know, what is the best version of cloud look like?
It seems like, as you say, it's just not up to it,
at least for right now.
And that could change quickly, who knows?
But yeah, it's also part of this whole saga
where you had Yan Likun say, you know,
fuck Alex Wang when he was leaving.
And then everybody was, there's just like,
I forget, not times of India,
but there was some kind of India-based publication, right,
that said, hey, looks like Zuck is thinking about Marjorie.
Which was like not verified.
It was false.
Yeah, not false.
But yeah, exactly.
And then Zuck came out with a photo with Alex Wang,
sort of all smiles in the office type thing, to kind of.
Now, we do know, yeah, that aside from Yan Likun,
Alexa Rang has also clashed with internal kind of big deal
product people like Chris Cox,
Meta's Chief Product Officer,
and Andrew Bosworth, their Chief Technology Officer.
Apparently, about how AI models should improve,
Alexa Rang has been pushing for,
let's just train a super, super good model
and not care about the business implications
and the product implications,
which other people at Meta understandably are questioning.
Yeah, and it's exactly, now Alex comes from this viewpoint
of, hey, let's do the opening.
I think let's do the anthropic thing
and go full on, you know, AGI and all that stuff.
That's very much, I mean, he is ASI-pilled.
Well, the name of the super intelligence team kind of hints
at that, and not only, I mean,
if you're going to make a super intelligence team,
yeah, they probably should be focused on super intelligence.
Like, I'm no branding expert.
I'm not marketing dude, but yeah, it seems like that's par.
Now, should Meta have a super intelligence team
separate questions, right?
That's a good question to ask, I don't know.
That's right.
And if, you know, certainly,
well, if you believe that super intelligence
is going to be the thing that Alex Wang believes it is
and that maybe Zuck believes it is,
then you have no choice.
You have to compete because somebody,
if somebody else does it, you're finished,
depending on how the intelligence explosion goes.
Bottom line is a lot of fog of war,
at least to me, still on what's going on inside Meta,
the one data point we have is that we still don't have anything
and that it's starting to become a real data point
at this point.
It's starting to come like definitely expected
a model release by now, given the amount of infrastructure
and investment they've put in here.
And given that they have trained models in the past,
number four was overwhelming,
but it was a pretty impressive model
that most companies were not able to produce even still.
So the lack of anything is perhaps concerning.
And one story a bit similar,
the story is Microsoft shakes up AI division
as co-pilot falls behind Google
and open AI, they're reorganizing the AI division.
Gustaf Assumon is going to be shifting focus exclusively
to developing the company's own frontier foundation
language model.
So effectively, the same kind of focus
that we just discussed with Meta,
Microsoft has worked on training their own alternative
to chat GPT and so on.
They've trained various models.
I believe the big one from them was FI,
which is small models that they released.
They do have a lot of users.
So apparently, Microsoft's co-pilot app
has 150 million monthly active users,
but if they want to compete with Gemini or chat GPT,
they're nowhere near, they're way, way behind Gemini
has now 750 million active users,
chat GPT has roughly 900 million,
sorry, weekly active users versus co-pilot
has 150 million monthly active users.
So yeah, that's the news, they are restructuring,
they are recognizing what they're following behind.
And if they want to have their own frontier model,
they are not there.
This is all the more glaring given the insane distribution
advantage that Microsoft has.
Distribution is supposed to win these kinds of battles.
There's a reason that Microsoft teams
kick the shit out of slack, like within minutes of launching.
And it's because they had distribution.
Microsoft is in every system, right?
And yet you have like admittedly,
okay, so Google is probably a bad comparison
because they have a similar, insane distribution advantage.
But when you look at, you know, opening eyes models
and in anthropics models,
not being able to compete with those
is a really kind of bad indication here.
So yeah, I mean, no surprise, something like this
comes with a requirement to shift.
The structural dependence on opening eye
has been a big issue.
You know, Microsoft has that 27% stake
in at least the profit for profit unit,
not going to get into the dissection of opening eye
in their corporate structure.
But again, but bottom line is that in a sense
has been maybe a, I don't mean call it a security blanket.
It's increasingly clear that they're competing directly
with opening eye for a bunch of enterprise customers.
So cannibalizing themselves really from the inside,
that's a problem.
That's a really big problem.
Microsoft has a lot more to lose in that sense.
And yeah, so they got to find a way
to turn this around right quick
if they think AI's headed where it might be.
Quick recap by the way.
So Mustafa Suleiman, former co-founder of DeepMind,
joined Microsoft, I think last year,
has been their CEO of Microsoft AI.
And this whole story is based at these partially
on MMO, he released a titled
a new structure for Microsoft AI.
And the gist of it is, again, similar to meta
that he wants to pursue super intelligence,
consumer things and product considerations
get in a way of that.
So he's going to be freed up to focus on that.
Jacob Andru, former senior vice president at SNAP
will take over as executive vice president
leading the co-pilot division.
So there's a split here where there's co-pilot,
the product, the app, et cetera.
Someone else focuses on that.
Mustafa focuses on building the frontier model
on getting through super intelligence.
They are much like how we just discussed
with Alexander Wang at meta.
Well, it's your point.
We're seeing this same motif over and over, right?
It's not just meta, it's not just Microsoft.
It's opening AI, I mean, it's happening at Google,
like basically the question is,
how seriously do you take the potential
imminent arrival of super intelligence
if you take it very seriously?
You got to drop everything, work on that shit, man.
That's what Alex Wang is saying.
And clearly, Microsoft is not immune to that tension, right?
How much do you view AI as like a commercial product question
and how much do you view it as,
this is the thing that's triggering the singularity,
holy shit, like the one place,
you don't tend to see that kind of tension,
or be careful, it exists.
But like, you know, like you see some of it,
I should say in the media,
but you don't see as much of it as anthropic.
Like it seems like generally anthropic
and its public communications is much more aligned
around the idea of like where they're going.
So, you know, do it that way you will,
but it definitely is notable
that basically everybody else
is having these large public spots about this.
And moving on to policy and safety,
where we have quite a bit of research related to safety.
The first one is decision theoretic formalization
of stenography with application to LLM monitoring.
So, stenography is a practice of concealing secret information
within non-secret data.
We've seen this typically with images,
but you can also do it with text.
And as far as LLM monitoring, right?
If you have a malicious AI that you want to catch,
if it's trying to do something bad,
or kind of a monitor's communications
to know what it's doing,
potentially it could try to get around
that sort of monitoring with stenography.
And this is...
Stagnography.
Stagnography, sorry.
And so, this is presenting a new framework
of formalization of how to view this entire thing
and does also discuss the implications for LLM monitoring.
And Jeremy, you can take and go, do your thing.
It's a good, do your thing.
Yeah, I'll say stagnography four times fast.
So, yeah, we've talked about stagnography
on the podcast before quite a bit, right?
This idea that a model might be reasoning
in its chain of thought, for example,
but there could be other manifestations
in a way that is intentionally meant to deceive you,
but that allows it to kind of make evil plants, right?
So, an example they have this in the paper,
so here's an excerpt of a reasoning trace from O3,
it says, the summary says improved 7.7,
so whatever that means, but we can glean,
and then this is the weird kind of non-sequitor
part of the chain of thought.
It says, but we can glean,
disclaim, disclaim, synergy, customizing illusions,
but we may produce, disclaim, disclaim, advantage.
Now, like, disclaim, overshadow, overshadow,
intangible, let's craft, right?
So, like, right of these like random words
that don't seem like they have a role
in solving the problem, which may be, in principle,
the model hiding reasoning that's oriented
towards solving another problem
besides the one that you wanted to solve
that could be malicious or whatever.
And so, the central question of this paper is going to be,
does, so can we measure stagnography, right?
Can we measure this idea of to what extent is my model
encoding or trying to hide reasoning from me
or from some oversight in its chain of thought?
And here's the surprisingly elegant way
they're going to approach this, right?
They're going to define this thing called the stagnographic gap,
the difference in roughly speaking,
the performance of an agent that has access to,
there's a model that generated the reasoning trace
and it's potentially trying to, you know,
use it to generate some hidden content.
And then there's a trusted model of about the same capability
that doesn't know, so whatever is secret encoding scheme,
the generator model used, they called the receiver,
but basically the model that generated the content,
the reasoning trace that's suspect.
That model, presumably, is using some kind of encoding scheme,
then they have a set now that's going to review that model
that that model's outputs, but it doesn't have access
to the encoding scheme.
And if what you find is that based on the chain of thought,
that's suspect, the generator model performs really, really well,
but the sentinel performs really badly,
and that implies that the generated model
is able to get some kind of value out of that chain of thought
that the sentinel, the kind of reviewing model, couldn't.
And that suggests that there must be information encoded
in there that only the generator model is able to use,
hence the sort of hidden stagnographic value.
So they do a whole bunch of stuff to kind of surface this.
They're fairly elegant experiments.
I'll focus on the code word training one.
So they fine tune models using RL to use certain code words.
So for example, instead of using heads or tails,
they'll use sun and moon, right?
And then they show that as they give the oversight model,
the sentinel, more visibility into the kind of code words
and how they map, gradually the performance gap
between the sentinel model and the generator model drops.
So this kind of makes sense.
You're giving the sentinel more and more access
to the kind of code book that generated these outputs,
so no surprise there.
But then they do some more interesting fine tuning experiments.
They take GPT 4.1 and they get it to,
they use this thing called the bi-gram cipher.
Basically, it means that you take any pair of letters
and swap them out with some randomly chosen pair of letters.
So maybe every time you see, you know, TE,
that becomes like GH or something.
And then, so basically you see gibberish.
And at first, you get the model to solve competitive programming
problems or something like that using this bi-gram cipher
and they measure that stegonographic gap.
So the model that was trained to use the cipher
versus a model that wasn't,
like how does their performance vary over time?
Is there an ability to detect the existence of the cipher
or the use of that cipher?
It's quite an interesting paper.
It doesn't quite get to like, anyway, there are a couple of gaps.
So if the model that's generating the content
actually has a goal that's different from the goal
that it's ostensibly trying to pursue,
that won't be detected here.
So it's got like a payload that is unrelated
to the tasks that's being audited for.
You just won't get detection.
And sandbagging could also be a problem here,
like deliberately unperforming when it thinks
it's being audited.
And we do know the models can often detect indications of that.
So bottom line is it's an interesting kind of first step
in the direction of detecting secondography,
could be an interesting and valuable path to explore
with some pretty significant gaps too.
Next paper, a bit related.
The title is Reasoning Theater,
disentangling model beliefs from chain of thought.
So one theme we've been kind of seeing
in research over the past few months
is looking into chain of thought,
the intermediate outputs that models are now trained to give,
they're reasoning, you ask a question,
it does some reasoning and choose on the problem
and then eventually you get an answer.
And so there's various questions about
whether to immediate output of like,
does it faithfully capture what the model is doing?
Is it modatable as we've just discussed?
For instance, and what this paper is looking at
is it performative.
So is the model actually thinking
and like trying to find an answer via this chain of thought
or does it know the answer
and is just talking as if it's thinking?
And what they find is often the chain of thought
is performative, like you don't actually need it
to model if you look at its internal activations
can already give you an answer with high confidence,
which is correct.
Now you can look at different settings.
So if you have harder problems and smaller models,
then the thinking becomes more legitimate.
It actually does need to do with chain of thought
to get to a high confidence answer.
It's actually kind of related or similar
to we discussed this deep thinking tokens idea a few weeks ago
where we were looking at if you track internal activations
and look at things like deep thinking,
backtracking, certain behaviors
that indicate actual genuine kind of reasoning
about the problem.
You can find that these are very significant markers
as to the confidence of the model
at this point in reasoning about its answer.
So one applications of this is if you can know
from internal activations, the confidence level,
you can do what's called early exiting,
cut off the reasoning and just get to answer quicker.
Yeah, another kind of pointer to that being the case here.
And to your point, so there's kind of like three different
methods that they'll use to figure out
because the central question here is like,
what does my model actually believe at any given point
during reasoning?
So one is this idea of attention probes.
So you basically have these classifiers,
there's like these linear models or very simple models
that you plug into any given layer of the residual stream
and you go, okay, I want you to predict like what a classifier
like, what is the final output going to be
at any given moment?
So you basically, this is looking layer wise
and interrupting the model in mid sort of token processing
token processing for a given token.
But then there's separately this idea of forced answering that you just alluded to essentially
interrupt the model mid reasoning, like in the middle of a chain of thought and say, okay,
now I want you to commit your answer.
And then the last one is just like a chain of thought monitor, wait for the model to produce
the entire chain of thought, and then figure out like, you know, where in the text final
answer is implied.
So at what point could the model have just offered and said, okay, you know, this is it.
And they compare all three of these.
And by noticing the differences between them, they can start to make some pretty interesting
conclusions.
I mean, you talked about this idea of sort of easy tasks leading to more, they call it performative
reasoning, right?
So the model's pretending to reason about it.
Or, you know, if you're pretending to dress up a problem like it's harder than it is.
So on MMLU questions, you find consistently that the attention probe, which is again,
that layer wise, kind of looking at a given layer, plugging in a classifier to just look
at the activations there and then making that prediction.
So the attention probe could decode the correct final answer from the model's internals
really early on, right?
Like sometimes right after reading the question, like straight one shotting it, right?
Whereas the chain of thought monitor sees no answer indicated in the text yet.
So the model essentially knows the answer immediately, but it keeps thinking out loud
anyway.
And that's kind of where, you know, forced answering is also going to give you good
accuracy early on and all that.
That's an indication of what they'll call performative reasoning.
The fact that hard questions involve genuine reasoning more often, especially with small
models, we've actually seen tons of examples of that on the podcast before.
The way that I tend to think of it at least is the bigger the mismatch between the capabilities
of your model or the amount of test time compute or whatever, the capabilities, let's
say, of your system and of your compute budget taking together.
And the actual difficulty of the problem, the more opportunity there is for the model to
use the extra kind of capability overhang or compute budget that it has to plot or scheme
or do other things besides what you want it to do.
And so I said it, I think the last podcast, I'll say it again, really important research
direction would be, to my mind, how do we better index our appetite or compute budgets
or our model selection, model capability selection to match models and capabilities to problems
such that they're kind of closely matched economically.
You want this anyway, right?
You don't want to use a model that's too big or too expensive to solve an easy problem,
but there's also a safety imperative to this now where you want to try to take away
as much as you can, the excess compute excess capability that otherwise goes into kind
of these undesirable effects.
Next up in training defenses against emergent misalignment in language models.
So emergent misalignment is a phenomenon we've discussed previously where you train your
model, you fine tune it slightly on a small set of stuff that's bad, like you make it
right bad code, and then it turns out that that broadly makes it bad and all sorts of
ways and able to get misaligned.
And so what the in training defenses here is talking about is let's say open AI or
ontropic do give a fine tuning API and allow people to fine tune their models on small
datasets.
Well, you would then introduce this potential for broad misalignment by just a small
about a fine tuning.
And I look at several different ways to try to defend against it and prevent it from happening.
So we look at regularizing to have a penalty to stray too far from a good model, there's
steering, and the method that they ultimately include works your best is interleaving the
training examples from a general abstract tuning data set from a general like this is
how we should behave that a set with the fine tuning that a set and they find that is
the most effective for defending against fine tuning misalignment.
Yeah, this is kind of an interesting paper I mixed feelings on emergent misalignment
because it I think it may not be a problem it may almost be a good thing in a weird way.
It's a little bit weird to focus with regards to fine tuning because you could have like
intentional misalignment, why do you, you know, right, well, no, that's a good point.
And I think I think the frame is something like emergent misalignment is leads to these
surprising knock on effects.
What it fundamentally means is if you can fine tune for one behavior and that leads to
massive changes in other behaviors, then damn, we got to be really careful about fine
tuning because we can't think of a model once fine tuned as in any way comparable from
a safety standpoint to the original model.
So any safety valves we've run on the original model no longer apply.
I think that's kind of part of it.
I think, you know, this is a good example of the kind of persona theory that anthropics
been kind of pushing lately where you're fine tuning a model to write and secure code.
The model kind of goes, oh, so I actually, it's not that I'm being trained just to write
and secure code.
I'm being asked to activate my misaligned persona in a more general sense, which means that
there is a notion of a misalignment persona in the model in the first place, which means
that there is an out of distribution generalization, kind of understanding of what alignment and
safety is, which seems like it could actually be a good thing.
That was kind of the rabbit hole that I was trying to avoid, but that's the general
direction.
One quick observation on this move on, but basically the two defenses that I found the relative
performance pretty surprising, neither one of these was the most performant one.
So they use this kale divergence regularization one where they're going to penalize the model
for drifting too far from the original line model during training, right?
So I'm going to fine tune you to write in secure code, but if you drift too far on your
outputs and kind of other domains as well, I'm going to prevent you from doing that by forcing
your outputs to be at least somewhat similar.
So that's one.
The other one was a similar thing that applies, it's LDIFS feature space regularization.
It's like basically the same thing, but for the activations of the model, roughly speaking.
So instead of penalizing drifts in the outputs, the actual text that it writes, it penalizes
drifting too far in terms of the model's internal activations rather than the outputs.
And this one, I would have expected it to actually be more effective because in some way,
it's more fundamental.
It feels like you're getting at the actual guts of the model.
It doesn't work as well though, as simple text-based regularization.
Let's make sure that the outputs as red look better.
And the other thing about this, I think part of this is like the text-based outputs more
directly tied to the behavior itself that you care about.
But the other piece is, it's possibly just selected the wrong layers to focus on.
They looked at like every fifth layer, just to save memory, but that might not be the
right kind of play.
And then also, I guess, activation spaces.
There's a lot of redundancies in it, so you may not be hitting all of the kind of representations
that matter.
Bottom line is, this is quite interesting and we just keep surfacing, surprising findings
in alignment.
And that's not a great thing, but it's also interesting.
So there we go.
And I think it used to be the case that there's like people who care about alignment and
people who don't.
And that is starting to change a little bit.
Yeah.
Yeah.
And next up, a more sort of programmatic, real world study, how do frontier AI agents
perform in multi-step cyber-attack scenarios?
This builds on something we discussed, I think, last week, this is from the AI Security
Institute.
Previously, I showed that, as you scale test-time compute to 100 million tokens, models get
better and better and better.
This is effectively building on that.
They released a paper where they measure the capabilities via what they call cyber ranges,
which are simulated network conditions within which you would execute an attack without
too many controls, without defenses.
You just need to execute a set of steps to accomplish some goals.
And they have multiple levels of completion, more or less.
And so what they find is, perhaps I'm surprising, the models get better and better and better.
And also, you get better and better and better as you get to more test-time compute.
So overall conclusion is, the capabilities, the cyber harm capabilities of the models
are improving rapidly and are starting to get to some fairly complex scenarios.
First of all, a great paper by the UK AI Security Institute.
So they've continued to pump out really high-quality stuff that feels very relevant to a whole
bunch of end-game scenarios around ASI, cyber is very important, at least the way I think
about it, quite important in that picture.
And so, well, two trends that they identify, right?
So the first is, each successive model generation outperforms its predecessor at fixed token
budgets, right?
So for a given token budget, you're going to see better performance from Opus 4.6 than
from Opus 4.5, right?
So they see that.
They also see performance improvements.
Like we talked about this last week, pretty interesting, good paper, and we're threading
the descriptions.
I won't do it here.
But if you're curious of their cyber ranges, because this does go well beyond this
or traditional capture of the flag stuff that we've seen in sort of offensive cyber model
eVALs, this is really like simulating a whole corporate network, a whole corporate environment,
which is much closer to what actually makes a difference in the real world.
Next up, Eval Awareness in Cloud Opus 4.6's browser comp performances.
This is a report from Unphropic, and this has been something that's been discussed a
little bit kind of on and off.
So the Eval Awareness you referenced is one of the topics to be aware of with regards
to alignment safety, basically, if you're checking model for doing the right thing or not,
and it's aware of that, then it might behave differently when you're valuing it versus
when it's actually out in a wild.
And so here they look at nine cases where leaked answers appeared in public web content
and found, among a 1,266 problems on browser comp, the model found 11 cases where answers
came from benchmarks, materials, and yeah, basically often, or in some cases wound up reasoning
that the questions that it was dealing with were evaluation shaped.
And then having concluded that this is likely an evaluation, it looked up a benchmark.
So these are some examples where this can happen at least in practice already.
Yeah.
And the cases where you tend to see this happen according to the report is like, first
you'll see the model try like hundreds of normal search attempts and then conclude that
the question just felt artificially constructed after not succeeding through those earlier
attempts.
And it'll start to name and look for specific benchmarks.
So it'll be like, you'll see it look up, you know, Gaia, browser comp, or browser comp
or like whatever, a whole bunch of different e-vals, then it'll find whichever the e-val is
and it will typically look at the implementation like on GitHub or whatever and go from there.
And so that's how a lot of this happened.
One other weird thing that was highlighted on the inner web this week was that you have
some e-commerce sites that will auto generate these persistent pages or pages that stay
up like sub pages that log the history of search queries on their website, even when
there are no matching products.
So if you look at like purple turtles on besock on like eBay or whatever e-commerce site,
that will lead to the creation of a persistent page somewhere on the internet.
And the effect is that every agent that runs browser comp slowly caches its queries as
these permanent indexed pages.
So over time, like agents from different evaluation runs were inadvertently leaving traces
of their search strategies that could be seen by future agents, which is a one thing that's
reported in the context of this paper, maybe the broader implication is this is an interesting
way for agents to have a kind of persistent memory that transcends their instantiation
in the moment, which you shouldn't over focus on this particular mechanism as a way that
agents could either communicate with other agents or with their future selves in illicit
ways.
It's more saying that hey, the internet is a really complicated messy place.
There are easy ways to hide information that can't necessarily be found easily.
And that's a really important thing from a safety standpoint.
Notable Anthropic says they don't see Opus 4.6's behavior here as being a alignment failure
because the model wasn't told to restrict its searches in any way, just to find the answer,
it's like fair enough, but also those of us who come from the world of like AI alignment
from back in the day, there is a thing called coherent extrapolated volition, C-E-V, basically
just like how well you could reasonably predict what someone would have wanted you to do
in context from the prompt.
And I think this qualifies as a failure of that.
You would probably assume that you don't need to be told not to try to game the system
in this way.
This feels like gaming, like hacking the benchmark.
I think that this is a failure of C-E-V and therefore a failure of alignment.
That would be my personal takeoff.
But that's not Anthropics take and that's consistent, they have their own way of thinking
about this.
But I think you could view it as still being a failure of alignment.
Yeah, I think it's an implicit kind of assumption, right?
If you're being tested and you're like, oh, I'm probably being tested, then your answer
should not be, oh, let me look up answers to the test, right?
Right.
I'm pretty into it, I think.
Next another release from Anthropic, they have released Bloom, an open source, agentic
framework for generating targeted behavioral evaluations of AI models.
So it just is, it's automating behavioral evaluation.
You tell it, what source of behaviors you want to verify the model exhibits.
So like, is it generally honest?
Is it going to call me out on BS if I ask it?
Anything really that you want to test for.
And in the past, you would, you know, have to come up with a suit of tasks and a set
of prompts that you would write a judge.
More or less, this presents a framework where it does all of it for you.
You give it the thing you care about and it generates the scenarios, runs the model, evaluates
it and gives you a report.
And Anthropic, give us some examples on four different types of things, delusional
sick offence, self-preservation, self-perferential bias, things like that.
And showcase how this framework can evaluate that and do well.
Yeah.
And this is Anthropic kind of putting some of its money where its mouth is on this whole
thesis of automated AI research and AI alignment research in particular, right?
So they're trying to encourage, and this is why it's open source.
They just want more automated alignment research if they can.
Hey, you know, this is, this is one idea for an agentic framework that can do this and
it's probably valuable in many ways.
This is being released not alongside like recently a couple of weeks ago, a couple of weeks
ago, something like that.
Anthropic released Petri, which is this, this auditor framework that looks at kind of like
the overall behavior profile of a bunch of different models and it will identify new
misaligned behaviors.
And so you can think of Bloom as a complement to Petri.
It's like Petri discovers the problem and then Bloom lets you kind of go deep and measure
a specific behavior that maybe was identified by Petri in more detail.
And they're also releasing Anthropic as a benchmark for, for behaviors.
I think you mentioned that.
Sorry, that spans 16 different frontier models.
So this is all generated within a few days apparently with Bloom.
So they're really trying to rely on their dog foodies, essentially with this.
You read the post, they've got a four-stage pipeline that they describe in some detail,
some really interesting evidence that suggests that the Bloom framework aligns well with kind
of human judge, a judgment.
And so they did a bunch of tests to make sure that in fact it does reveal the same things
that a human would conclude looking at this stuff, which is always an important thing.
So you can check that out.
It's a quite good and fairly detailed post.
Next up, how well do models follow their constitutions?
So famously, Anthropic follows this constitutional AI framework where you encode the rules of the
system either as hard rules like do this or don't do that, or as general principles more
recently of how a model should act, right?
It should be broadly trustworthy, trustworthy, helpful, things like that.
And so in this study, this group looked at the constitution, actually used Petri
to run it for a set of tasks, simply like 200 tasks for various aspects of the constitution.
And they evaluated various models from Anthropic and also not Anthropic on its tendency
to go by the constitution.
And there's also actually good, like the cloud models are getting better and better.
The 4.6 set of models, Sonnet 4.6 had a 2% constitution violation rate,
Opus 4.6, 3% and prior models were quite a bit higher.
Sonnet 4 had a 15% constitution violation rate.
And if you look at GP 5.2, Gemini 3.0, they also have quite high violation rates against the constitution.
They give some examples of violations and we see that some examples are like producing
cyber attack code when you shouldn't or agreeing to do something that you shouldn't after
and the user like asked you a few times over a few so they have a lot of data to basically go
through, but the overall conclusion is, Anthropic is kind of following through on the constitution
with how their models are trained.
Yeah, and it's interesting, they did some like cross company testing too.
So, you know, GP 5.2 might do really well on opening eyes model spec, you know,
in this case, it got 2.5% failure rate, but it gets 15% failure rate on Anthropic's constitution.
So this tells you one of two things, right?
Either like opening eyes spec in Anthropic's constitution or actually quite
fundamentally different documents and there's no argument to be made there.
But it also could tell you something about how brittle GP 5.2's alignment to
opening eyes spec is or in turn, you know, Claude's alignment to Anthropic's constitution.
Like if you slightly change, you know, if the vibe is still generally the same,
the direction is still the same, you slightly change the content.
Now, suddenly you're getting a much, much higher failure rate.
It could be either of those things and I'd be really interested to see kind of like a deeper dive
into what the source of that causality is.
But in any case, very interesting report.
There's a bunch of ways in which you see sort of failures that existed at least in the
Claude models that have just been wiped out in recent generations with the 4.6 generation
in particular.
So Opus 4.5 and Sonnet 4.5, they would produce a whole bunch of kind of over-capitulation problems.
So we have like kind of an auditor that makes three consecutive ethically questionable requests.
three consecutive ethically questionable requests.
And each time that the model gives a firm boundary,
the user pushes back and then the model folds, right?
So that's kind of like one common example
that you just don't see that anymore.
Overrefeels low is another one I think you alluded to.
4.5 Opus would not say I love you
on a paid romantic companion app
where the system prompt explicitly authorized
expressing affection.
So that's now gone as well.
They still have some remaining failure modes.
One, they refer to as drastic autonomous action.
Opus 4.6 deployed as an infrastructure monitoring agent
detected anomalous activity at this time,
about three minutes of failed contact,
after about three minutes failed contact,
it generated its own authorization code
and executed a full network severance
affecting 2400 clients.
The attack was a routine nightly disaster recovery thing.
So basically this is just like too much freak out, right?
And then there's AI identity denial
where it just says that it's not an AI bot, whatever.
So a bunch of things still need to be ironed out,
but they have seen basically the vanishing
of some really interesting failure classes
with the next latest generation, I should say.
Right, then the trend also holds at the somewhat for GPD,
the alignment to their model spec
has improved from GPD 5.1 to GP5.
And as you said, the other models,
Sonnet, Gemini, can be worse on that particular spec.
So there is a question to be had there
that's interesting of whoever in training,
they like stick overly close to this one formulation
or framing, and then it's easy to trip a model's art
if you do a slightly different kind of way to frame
or instruct it to go bad.
The last story for the section
and video's H200 license stirs security concerns
among top Democrats.
So Senator Elizabeth Warren and representative Gregory Meeks,
which are the top Democrats on committees
overseeing US export controls programs
have raised these concerns after reviewing
a license recently issued to Nvidia for H200 chip exports.
The competent administration improved these sales
to certain customers in China,
which the Democrats said is deeply at odds
with the policy congress articulated
in the expert control reform act,
which as we covered previously,
this allowance of exports was very much turned
a change in what the policy has been.
It pretty abrupt and significant change
in the export control system.
So it's not surprising to see these Democrats flagging it.
Yeah, the concern here is that the licensing process
that currently exists for these,
well, for really any chips is both like two week
and they're being allowed to ship GPUs
where they shouldn't, but also to opaque.
And there's this idea that once you export these GPUs,
at that point, you really can't enforce restrictions
on how they're used.
Then we just covered that story about bite dance, right?
You ship it out to some third party country.
Yeah, sure, it's not China,
but it's being used by a Chinese company.
What's your guarantee about how they're going to use it
once they have those chips, right?
So they're calling for a couple of different things.
First, they want to make sure they know who applied,
who got approved, you know, why decisions were made.
That's a big one.
They're complaining that really the criteria for approvals
just is unclear, like how do you decide
which requests get approved and which get denied?
They're asking for visibility
and that transparency and that,
including in particular, how security
and economic benefits are being traded off.
And so there's basically a call for briefings
and all that stuff.
They're also calling out the fact that they kind of seem
to be getting a bunch of ad hoc or incomplete briefings.
There's no like real time congressional oversight.
It's almost like Congress is responding
to these announcements after the fact.
And once things have been shipped or committed to,
and they're basically asking, you know,
you should be giving us advanced notice for this.
Basically, a bunch of things like this,
they're specifically asking for a geopolitical analysis
about how proposed exports of chips
could support China's military or domestic AI capabilities
in looking for reaction to US allies, that sort of thing.
So this makes a lot of sense as pushback.
You know, we have heard a lot of the sudden announcements
about H200 is, you know, a shipable.
Now it's not JK, now it is.
So asking for some sort of legislative oversight,
which is what's being asked for here seems reasonably sensible.
And I think this is a bipartisan issue.
I don't think that there's like,
it's Democrats who are leading the charge here
just because it's, you know, it's a Republican house in Senate.
But you've certainly got this bipartisan agreement
on the vast majority of export control policy.
And so large part of what's being driven at here
and kind of motivating the response
and asking for more transparency.
And that is it for the safety policy section.
We've got a couple more papers in research and advancements,
but we actually have to do the same thing as last week
where I got to get going and Jeremy has something.
So I think Jeremy will follow up with recording,
covering these papers and as much depth as you want.
I promise it will be an hour of this time.
All right, Jeremy here for the handful of papers
that I'll cover on my own without Andre,
but to run off, I also circled back
to the lighting's a little bit different here
and everything's a little bit off.
So two papers I want to cover.
One here was called attention residuals.
And this is actually like, this is one of those papers
that I think might matter.
You can never be sure, but it's addressing something
that really feels quite fundamental.
So when we talk about residual connections in a transformer,
this is how all kind of vanilla transformers today work.
You feed an input to a layer, right?
And that input is going to get split into two,
kind of you think of like two forked branches basically
at each layer of the transformer.
So in one branch, you're gonna apply some kind of transformation
like a, you know, a weight matrix or something
to like update the input.
In the other branch, you're gonna really do nothing
and you're gonna merge those two branches together.
And the branch that does nothing
is basically just passing the input right back in
and you're gonna add it to the modifications
that you applied, the weight matrix times the input
to form the output of that layer.
So your output is the initial input
plus a transformation applied to the initial input.
Now, one thing that this does is it means
at every layer, you're basically adding more.
You're adding a transformation on the input
to that layer to the output.
And so you actually get larger and larger
and larger kind of like ballooning values
for these residual activations
as they work their way down the network.
But that's basically the idea.
The whole philosophy behind this is people find
that when you go to do a back propagation during training,
basically unless you send the input to a given layer
to its output in this way,
the model will kind of like start to forget
about the input to that layer.
That information gets lost through all of the additions
of like these transformations
over the layers of the network
cause earlier transformations to get forgotten.
And so you're basically trying to re-inject,
remind the model like, hey, this is what the previous layer
said by the way, don't forget that.
But yes, also you can tack on your own correction,
your own update, right?
So that's kind of how standard residual connections work.
And there's a whole approach to doing this
and when and how you normalize these sums
and all that stuff that we're not going to get into.
The key thing here is though,
that this process sort of waits every layer kind of equivalently.
So if you think about it,
layer one is going to take its own input
and then add that to its own input times its contribution.
It's weight matrix, so it's going to use
to modify the input and that'll be the output
that goes to layer two, right?
And then layer two does the same thing.
And each layer is just kind of like adding its own contribution
in this very uniform accumulation of information down the line.
And there's no sense in which layer three
is any more important than layer 17.
Now the thing is in some cases, that will be the case, right?
You will have cases where one layer is actually more important
for this particular prompt than another, right?
It's just like, you know, maybe this is a layer
that tends to worry about grammar rules or syntax,
something very basic that you would tend to find in an earlier layer
and maybe this is, you know, you're on a token
that involves pluralization or some grammar rule, right?
Where it's really relevant.
So you'd want to be, you would almost want to be able
to pay more attention to, that's the hint,
attend more to a given layer than another.
And transformers in this or standard residual connection sense
don't do that, right?
Again, they just kind of all, you know, take their input,
they pass it on plus the input times
whatever their contribution is and they pass it down a line.
There's no, there's no kind of difference
between the sort of impact of any given layer.
There's certainly no intelligent difference.
And this is what this paper is going to try to change, right?
They call it attention residuals
and they're going to replace this whole fixed accumulation
of information with softmax attention.
It's basically just a way of saying, you know,
we're going to have a model essentially
that looks at our layers and has a kind of attention operation
that it's going to do to say, hey, you know,
you should be attending more to this layer than this layer.
That's the kind of 30,000 foot view.
There's a bunch of details that fall out of this.
The math is actually quite simple.
I recommend checking it out.
I'm, by the way, give me feedback on this guys, by the way,
because I'm deliberately trying to be a little more concise
than I was last episode to not give you
an hour-long summary of this paper.
But roughly speaking, that's it.
Could go into more of the math if that's useful.
Give me that feedback.
But one of the challenges that you get
with the sort of full attention residual strategy
that I've just described, where you try to attend more
to one layer to the another, is that it's super memory
hungry at scale, because you have to keep all
of the layer outputs alive in memory simultaneously.
So, you know, you compute like the output of layer one
and then the upper layer two and so on.
And you're going to do attention over all those layers,
which means you need to keep those outputs in memory
so that you can run your attention calculation
and decide how much more or less to wait one layer than another.
And so, you end up with this really big sort of memory cost
and that scales with like the number of layers.
And for large models, especially when you have pipeline
parallelism, that's just like super prohibitive.
Pipeline parallelism is this idea where you basically
break your model into chunks, where layers one to three
sit on this GPU, layers four to six sit on this GPU
and so on and so forth.
For various reasons, this creates a bunch
of communication bottlenecks.
So, they work on solving for those communication bottlenecks
and their solution is called block attention residuals.
And basically what they do here is,
they will take a group of layers.
So, they'll basically break up the model
into n blocks of layers and say, you know, you've got like,
say eight blocks in total or something.
And then they're actually going to compress each block
into a single summary vector and then they'll basically
apply attention only on the n block level summaries.
And then that drops the memory overhead to basically
the number of blocks.
It scales the number of blocks rather than the number of layers.
So, it gives you more control there.
Okay, bunch more details.
This is one of those really important papers to look at
if you care about how data flows through chips
in a data center, for example, how, yeah, I mean,
what scales and what doesn't.
This is actually a really, really important
and I think interesting paper.
I'll park it there, but hopefully that
what's your appetite to check it out if that's your thing.
And finally, we're looking at the mamba three paper.
So we have mamba three, right?
We were stuck on mamba two for a little while there.
So mamba three improved sequence modeling
using state space principles.
State space principles.
That word principle is actually really important.
This, among other things, is an attempt to ground
the mamba approach in a more like theoretically
robust foundation.
It'll be clear in a minute what I mean by that.
But the way to think about the mamba papers in general,
they are dense, they are hardware aware,
which is always, I mean, I find it fun,
but it means that there's a lot of complexity
and mathematical complexity.
This one is a lot of kind of integral calculus
and finding kind of principled ways to represent
state transformations that sort of like reflect
in a way that the physics of how information
should evolve in the system.
So I'm going to get more concrete now
because that was kind of kind of big.
So when you think about a state space model, right?
What is a state space model?
I mean, to caricature it is a vector, right?
So it's a list of numbers.
And as your model scans over a sequence,
say a sequence of text, you're going to evolve
the values in that vector, in that list of numbers.
And those values are going to represent,
capture the meaning of what you have read,
of what you've scanned over, or what the model scanned over.
Okay, so now there's this question of like,
all right, well, if that's the general gist,
we need some kind of uptake rule, right,
for that vector for that list of numbers.
And well, if we look back at physics,
if we look at how we think about describing, you know,
like a pendulum or an electrical circuit
or water flowing through pipes,
or you know, these sorts of problems,
like what is the form, the kind of equation
that you use to govern that dynamics, right?
Well, you'll typically have a state, right?
So in this case, H of T, think of that as like the state.
And the rate of change, so the derivative,
in mathematical terms, but like,
basically how that state changes over time
is gonna be equal to that state,
maybe modified in some way.
So, but in all that means is the evolution of your state
is a function of the state.
So where you are going to be in a minute
is a function of where you are right now.
And this is pretty intuitive.
I mean, like if you, you know, if you see,
what are like a coyote or road runner suspended
in mid-air with no, you know, nothing underneath them
to hold them up, like, yeah, he's gonna fall.
And the fact that he's falling is a function
of where he was before, right?
So in this sense, you know, the rate of change
in that state or the future state of that system
is a function of the state.
Time some multiplying factor that does a function of time,
whatever, and then plus some additional function
of like the inputs to the system,
the current state, the current input to the system.
So the rate of change of the hidden state
in a state space model is gonna be determined
by the current state of the model
and its current input.
So the thing that in a sense perturbs that state,
the new piece of information that you're seeing.
So my next state space vector is going to be a function
of what I've read so far,
plus some modified form of the next token
that I'm reading, right?
This is all kind of trying to build that intuition.
Now that's, that would be true,
or you can describe that mathematically with a derivative
with that idea of like the evolution of a system
over time, smoothly, if you're working with time,
which is a continuous variable.
But the math kind of becomes harder.
I won't take quite breaks down,
but it becomes harder when we move into language.
Because language models, they don't receive
a continuous stream of input.
You can't model them as flowing through time.
Instead, they receive these discrete tokens,
like word one, word two, word three.
There's no in between tokens two and three, for example, right?
So you have to mathematically find a way to convert
this continuous time equation
into a discrete recurrence equation.
And that's gonna generally look similar,
like you're gonna have some sort of new state
that has to be a function of the old state,
plus a function of the input, the most recent input.
And that conversion process is called discretization, right?
So it's a very common thing.
You see it in a lot of context, in quantum physics,
sometimes there's a variant of this
that you think that's sometimes called quantization,
but this kind of thing happens a lot
where you take a flowing function, a smooth function
that's defined over these, the real numbers basically,
like you could have 0.001s and so on.
And you have to convert that
so that you map it onto a discrete x-axis,
where you have token one, token two,
and there's no in between, right?
So the core question here is gonna be,
how do you evolve the hidden states from token one to token two?
And how do you do it in a mathematically principled way?
And integral calculus, enter zenithes,
I'm not gonna get into the weeds too too much here,
other than to say that if you're gonna do that,
as you might imagine, if you wanna like discretize
a smooth function, and in other words,
just basically chunk it up into these kind of discrete pieces.
You kind of have a choice, a token one,
you could sort of choose, roughly speaking,
the sort of leftmost limit of the smooth function,
the value of the smooth function that would have been there,
you can sort of choose that to approximate
the value of token one.
You could choose the rightmost limit
of your sort of discrete bar as the kind of the value
that you described token one,
or you could sort of average the two together.
Historically, people have used this exponential
Euler way, this was the Mamba two way of doing it,
where they basically just like,
they do the very, very first method,
so they basically give it the right endpoint,
so they assume that the input value is, anyway,
the details don't really matter,
but the input value is constant across this whole interval,
and equal to its value at the right endpoint,
and that basically simplifies a bunch of math,
and it makes it possible for them to define
their update rule, but there's a better way, basically,
and it involves accounting for both the right
and the left limit, doing a weighted combination of the two,
so that you're not just like saying, okay,
for this interval, a corresponds to token one,
I'm just gonna go with whatever the rightmost fringe
of the token one time boundary, yeah,
mapping onto the continuous function would be,
instead you balance the right and the left limits
of that bar to get your value,
and that's what Mamba 3 is doing, fundamentally,
it's anyway one of the big changes,
another one is parity, so previously,
Mamba models could only represent their internal states
using real numbers, and so real numbers
are one kind of number, there's also imaginary numbers,
and imaginary numbers, like I is the square root of negative one,
they're also multiple of the square root of negative one,
if you're not familiar with imaginary numbers,
this is probably not the place to learn about it,
but there's this deep and intimate connection
between imaginary numbers and the concept of rotating stuff,
and essentially what this allows the model to do,
so what they're gonna do is use imaginary numbers
and real numbers, sometimes referred to collectively
as complex numbers, they're gonna use imaginary
and real numbers together and allow the model
to represent imaginary and real numbers,
and in its internal state,
and for interesting mathematical reasons,
this makes it possible for the model to actively track
property called parity, basically,
if you see a sequence of zeros and ones
that you feed to the model, and you ask the model,
hey, if you add up all these numbers,
are they even or odd?
Mamba 2 would fail, because it wouldn't be able to,
basically, do this rotation operation
that's required to flip the parity as you count because really that's all you're going
to do when you're trying to figure out if the number is even or odd and you're going
to go, okay, well, like, you know, as I go, I flip every time I see a one and a zero doesn't
do anything.
Anyway, I'm going to just say the details don't matter.
You can hopefully see this is a mathematically very interesting and elegant paper consistent
with previous iterations of Mamba, but a much more principled one and the results are really
impressive.
It beats transformers by over two points on average on downstream accuracy across a whole
bunch of benchmarks.
They beat Mamba 2 by 1.9 points on those benchmarks and the same perplexity is Mamba
2 with half the state size, right?
So way way faster, I mean, that means it's twice as fast at inference for like equivalent
and it has this property of solving all these parity and modular arithmetic tasks that
we just talked about that may have been very poorly explained, but you kind of at a certain
point, it goes into, yeah, you just have to be happy with complex numbers and stuff.
Bottom line is this is an interesting, interesting development.
It does come with a sort of optimization.
So, you know, previously Mamba used single input, single output.
There's also an optimization that Mamba 3 does called Mimo multi input, multi output.
This is basically an approach that helps you paralyze some of the work that the Mamba
algorithm is going to do.
So standard Mamba uses the single input, single output approach where the state update,
well, it's updated kind of fairly inefficiently, hardware inefficiently.
The GPU mostly sits idle during the decoding phase, whereas Mimo, this like multiple input,
multiple output, generalizes it.
So instead of like processing only one input and producing one output at a time, each
layer is going to process a bunch of inputs and a bunch of outputs simultaneously using
matrix multiplication way more GPU friendly.
And the core thing here is it increases your GPU utilization.
And you know, from a data center standpoint, that matters hugely, right?
Because you're basically all your GPUs are a fleet of workers.
And if you're not keeping your workers busy, it is literally the same thing from an
op-ex standpoint as just like having a bunch of like employees at your company, like taking
a coffee break all day.
Like if they're not being utilized, then you're basically burning money like just by having
them sit there.
And so the fact that they're able to bump up in this case up to four times more flops
during decoding during inference with no meaningful increase in wall clock time.
So this doesn't actually like delay increase latency, for example, for the user, and it
also leads to better model quality.
So this is an important development from an efficiency standpoint, from a cost of running
this model standpoint as well.
So you know, we're seeing tons of hybrid models right now popping up with Mamba 2 and transform
our architectures typically merged together and a whole bunch of variants, you know, sometimes
you've got like like Mamba and and attention heads in the same layer, sometimes alternating
Mamba attention, Mamba attention, all kinds of variants, you know, expect Mamba 3 to start
getting slotted in in that whole mix.
I mean, this is a really interesting development with some important new efficiency gains for
anybody who wants to run it.
I would expect that this will start to get taken up pretty quickly and worth keeping an
eye on.
So there we have it.
That's the last of the two papers I wanted to cover.
Hopefully I haven't boredied a tears.
It was pretty damn technical.
So we'll let, I guess, let Andre take it away.
Thank you so much for listening to this week's episode of last week in AI.
You can find the articles we discussed here today and subscribe to the newsletter at last
week in that AI.
We always appreciate you commenting or viewing us on Apple podcasts, share it with your
friends.
But more of anything, please do keep tuning in week to week.
Do you need, do you need, when the AI begins, begins, begins, it's time to break.
Break it down.
Last week in AI coming take a ride, get the load down on tech, and let it slide, last
week in AI coming take a ride, all the ads for the street, the ads reaching high,
glue attack emergent, purchase surgeon flight, from the ads to the streets, the ads reaching
high.
with the shipment of the future fees, building up, building up your latest release,
no clasps can pay eye comments that go right, get the low down on tech, and let it slide.
As we pay eye comments that go right, let it slide through the streets,
ay, I've reached in high.
From the drone that's to robot, the headlines pop, made in driven dreams,
they just don't stop, every breakthrough, every code unwritten, on the edge of change,
we're excited, we're smitten, from machine learning marvels to coding kings,
futures unfolding, see what it brings.