r/singularity • u/Present-Boat-2053 • 1d ago
LLM News Llama 4 Maverick is lmarena maxed and in reality worse than models that are half a year old
72
u/ezjakes 1d ago
This is just typically the case with small models. They do poorly with style control. However SOTA from some months ago in a 17b active parameter model is not awful.
47
u/meister2983 1d ago
It's a 400 billion parameter total model. That's not small.
40
u/Recoil42 1d ago
- It's a 17B MoE, it doesn't run like a 400B parameter model.
- It is native multimodal with a 1M context window.
- It beats 2.0 Flash Thinking.
Some of y'all are never happy.
1
u/BriefImplement9843 3h ago
it's not even CLOSE to flash thinking. 2.0 flash also blows it out of the water. it's competing with llama 3.1 and mistral small. this thing forgets the conversation after 10 prompts.
13
u/Proud_Fox_684 1d ago
True, but it is the smallest model on that list. And most of them are reasoning models.
Leaving only GPT-4o, GPT-4.5, DeepSeek-V3 and Grok 3 as the non-reasoning ones. GPT-4o and GPT-4.5 and Grok 3 are all significantly larger. Only DeepSeek is roughly the same size. (DeepSeek-V3 and R1 both use 37 Billion parameters out of 671 Billion total). Llama 4 Maverick uses 17 Billion active parameters out of 400 Billion total.5
u/meister2983 1d ago
How do you know 4o's parameter count?
12
u/Proud_Fox_684 1d ago
Well I don't ofc. It's an assumption :P But I do know that GPT-4 was 1,8 Trillion parameters total. GPT-4o is basically a multimodal that was released after that. Epoch then did an analysis cost, requests per minute and they believe it was somewhere in the hundreds of billions of parameters. Smaller than GPT-4 according to them, but still large.
0
u/Present-Boat-2053 1d ago
I appreciate it being good but the normal score is just inflated
17
u/Proud_Fox_684 1d ago
Given the fact that it's not a reasoning model, and that it only has 17 Billion parameters active per token (out of 400B total), it's performance is really good. All the models that rank higher than Maverick on that list are larger than Llama 4 Maverick. In some cases, significantly larger.
The only ones on that list that are close to Llama 4 Maverick in terms of parameter size, are DeepSeek-R1 and DeepSeek-V3. Each with 37 Billion parameters active out of 671 B total.
They also said that Llama 4 Behemoth hasn't finished training yet. It's the parent model of these smaller distilled versions. So maybe they will improve too after Behemoth has finished training and is distilled down to smaller versions again.
14
u/Pleasant-PolarBear 1d ago
What exactly is style control?
3
u/chilly-parka26 Human-like digital agents 2026 1d ago
No markdown/emojis, just raw text.
27
u/RipleyVanDalen We must not allow AGI without UBI 1d ago
That’s not quite what it means. It means they account for things like length and number of headers and make sure those don’t have biasing influence over the rating.
11
u/imDaGoatnocap ▪️agi will run on my GPU server 1d ago edited 1d ago
In reality, nobody cares about LMArena
Math/coding benchmarks , context window , and token pricing are the only relevant metrics
16
28
u/Pyros-SD-Models 1d ago
What do you mean? People who actually develop applications with LLMs absolutely care about LMArena. We have multiple apps with >100k daily users, and LMArena is the most accurate user preference check out there, it also matches our internal A/B tests.
The general population doesn't care about math, coding, context window size, or any of that. Nobody cares if the airline support chatbot can do good math, but people definitely care if it can generate visually appealing markdown, is easy to understand, and makes the user feel good.
Sometimes the sub is hilarious "I don't understand the use case of this particular benchmark, so the only explanation is, that the benchmark is stupid, can't be me"
1
u/RMCPhoto 4h ago
They shouldn't unless we can be guaranteed that the "default" system prompt is used, or if it is required that the system prompt be shared.
-16
u/imDaGoatnocap ▪️agi will run on my GPU server 1d ago edited 1d ago
Here is what I mean, and I will be abundantly clear
This sub is called r/singularity, we are interested in developing super intelligence, not slop SaaS apps with a "AI ✨" button
(no hate to u or your apps btw, make that money)
-9
u/0xFatWhiteMan 1d ago
All the models are extremely good now, lmarena seems outdated, I am developing an app and I don't care about it.
Pricing, window size, API, I care about
1
u/haha0542 3h ago
That's simply not true. Window size without proper evaluation is simply meaningless. Missing in the middle is already a well-known issue for these LLMs, you can have infinite window, but it turns out Llama4 tends to easily forget so what's the point of having a 10m context?
I'm curious what kind of app or products can be or have been built on vanilla Llama models.
1
u/MINIMAN10001 17h ago
I mean even then Math/coding will only be one possible use of an LLM. There are numerous focuses and LLM can have and tracking that down for what is considered SOTA in a specific use case is important.
Context window and token price and simply two other metrics that at the very least have a wide scope of use across many use cases.
1
u/imDaGoatnocap ▪️agi will run on my GPU server 17h ago
You need math and code to automate ML research which is key for acceleration.
1
0
u/cashmate 23h ago
There are too many people shitting on LMarena because their favorite autistic math/coding models aren't scoring as high on there as they expect.
1
u/RMCPhoto 4h ago
LMarena is "fun" and you get an idea for the big differences between some classes of models. Once it gets into the sub 100 point difference it's entertaining to see how all of the different companies try to edge ahead, but it's not necessarily the top model that takes first place.
It's like F1, there's the machine and the driver and the fastest car doesn't always win. Similarly, lmarena can be gamed pretty heavily via the system prompt to provide the style of answer that people prefer more often.
Answers with a lot of formatting that are longer are generally voted for more often even if they aren't technically as good.
2
u/Professional_Job_307 AGI 2026 1d ago
The score on GPQA diamond says otherwise. But I'm open to seeing more benchmarks, all we have now are benchmarks from meta themselves.
2
u/Dear-One-6884 ▪️ Narrow ASI 2026|AGI in the coming weeks 1d ago
Maverick's competition seems to be Gemini 2.0 Flash, which is already pretty GOATed. If it can beat it + is local then it's a game changer.
3
2
1
1
u/meister2983 1d ago edited 1d ago
It's doing pretty well on hard prompts style controlled, tied with 3.7 sonnet thinking, but the confidence interval is huge. Coding is strong as well.
But again who knows. Lmareana usefulness has been breaking down the last few months. I want to see livebench.
I played around with it on meta.ai. Not really impressed. Sonnet 3.6 level maybe?
1
0
u/Healthy-Nebula-3603 1d ago
Not surprised ... Looking on benchmarks Scout 109b is worse than llama 3.3 70b ...
3
u/Aggressive-Physics17 1d ago
Saying that 4 Scout is worse on benchmarks than 3.3 70B isn't accurate because:
MMMU & MMMU Pro & MathVista & ChartQA & DocVQA:
69.4%, 52.2%, 70.7%, 88.8%, 94.4% (LLaMa 4 Scout)
Not multimodal (LLaMa 3.3 70B & LLaMa 3.1 405B)LiveCodeBench (pass@1):
33.3% (LLaMa 3.3 70B) - +1.5% over 4 Scout
32.8% (LLaMa 4 Scout)MMLU-Pro:
74.3% (LLaMa 4 Scout) - +1.4% over 3.1 405B
73.3% (LLaMa 3.1 405B) - +6.4% over 3.3 70B
68.9% (LLaMa 3.3 70B)GPQA Diamond:
57.2% (LLaMa 4 Scout) - +12.8% over 3.1 405B
50.7% (LLaMa 3.1 405B) - +0.4% over 3.3 70B
50.5% (LLaMa 3.3 70B)-1
u/Healthy-Nebula-3603 1d ago
But you not consider Scout is 50% bigger and much newer and should be more advanced
0
u/RMCPhoto 4h ago
Scout is a Moe model with lower computational costs due to 17b active, so it's easier to host for the big companies. The 10million context is also interesting and could unlock some very unique use cases if it's functional and has an efficient caching method. (think the entry-poiny to a rag system + large reasoning model as the output step).
Llama 4 is an improvement over llama 3.
Llama 3.3 also came out in December 2024. Llama 3 came out in April 2024.
Llama 3.3 is a significant improvement, and I suspect there will be some big improvements to llama 4 soon enough.
It's possible to run llama 4 with a CPU and RAM for less money and electrical costs than it would take to run llama 3.3 with 2x3090+
-4
u/Present-Boat-2053 1d ago
Look at this 100 year old gpt 4o 11-20 being better
6
u/iperson4213 1d ago
4o 2024-11-20, as the name suggests, was release in november 2024, under 5 months ago
2
34
u/why06 ▪️ still waiting for the "one more thing." 1d ago
Eh, I still want to see a llama 4, especially if they implement half the stuff I see in all their papers.