Llama 4 Maverick is lmarena maxed and in reality worse than models that are half a year old

34

u/why06 ▪️ still waiting for the "one more thing." 1d ago

Eh, I still want to see a llama 4, especially if they implement half the stuff I see in all their papers.

29

u/Present-Boat-2053 1d ago

10m context window is crazy

8

u/why06 ▪️ still waiting for the "one more thing." 1d ago

Yeah I'm just now reading the details. 128 experts. That's wild.
I was out when I wrote that. I didn't know they released it's fully released. 2T on the Behemoth. Wild.

3

u/Proof_Cartoonist5276 7h ago

Apparently it’s ass after like 16k tokens

2

u/Luuigi 17h ago

Thats the most disappointing part - when will they actually use LCM, BLT or latent reasoning. I want to see these things scaled and not one more MoE that doesnt really outperform other open source models especially with qwen 3 and r2 around the corner.

1

u/sdmat NI skeptic 12h ago

Research to published model has a long lead time - they can't just throw every idea into the next model. Each need to be validated at intermediate scales, tested for compatibility with other architectural changes, etc.

72

u/ezjakes 1d ago

This is just typically the case with small models. They do poorly with style control. However SOTA from some months ago in a 17b active parameter model is not awful.

47

u/meister2983 1d ago

It's a 400 billion parameter total model. That's not small.

40

u/Recoil42 1d ago

It's a 17B MoE, it doesn't run like a 400B parameter model.

It is native multimodal with a 1M context window.

It beats 2.0 Flash Thinking.

Some of y'all are never happy.

1

u/BriefImplement9843 3h ago

it's not even CLOSE to flash thinking. 2.0 flash also blows it out of the water. it's competing with llama 3.1 and mistral small. this thing forgets the conversation after 10 prompts.

13

u/Proud_Fox_684 1d ago

True, but it is the smallest model on that list. And most of them are reasoning models.
Leaving only GPT-4o, GPT-4.5, DeepSeek-V3 and Grok 3 as the non-reasoning ones. GPT-4o and GPT-4.5 and Grok 3 are all significantly larger. Only DeepSeek is roughly the same size. (DeepSeek-V3 and R1 both use 37 Billion parameters out of 671 Billion total). Llama 4 Maverick uses 17 Billion active parameters out of 400 Billion total.

5

u/meister2983 1d ago

How do you know 4o's parameter count?

12

u/Proud_Fox_684 1d ago

Well I don't ofc. It's an assumption :P But I do know that GPT-4 was 1,8 Trillion parameters total. GPT-4o is basically a multimodal that was released after that. Epoch then did an analysis cost, requests per minute and they believe it was somewhere in the hundreds of billions of parameters. Smaller than GPT-4 according to them, but still large.

0

u/Present-Boat-2053 1d ago

I appreciate it being good but the normal score is just inflated

17

u/Proud_Fox_684 1d ago

Given the fact that it's not a reasoning model, and that it only has 17 Billion parameters active per token (out of 400B total), it's performance is really good. All the models that rank higher than Maverick on that list are larger than Llama 4 Maverick. In some cases, significantly larger.

The only ones on that list that are close to Llama 4 Maverick in terms of parameter size, are DeepSeek-R1 and DeepSeek-V3. Each with 37 Billion parameters active out of 671 B total.

They also said that Llama 4 Behemoth hasn't finished training yet. It's the parent model of these smaller distilled versions. So maybe they will improve too after Behemoth has finished training and is distilled down to smaller versions again.

14

u/Pleasant-PolarBear 1d ago

What exactly is style control?

3

u/chilly-parka26 Human-like digital agents 2026 1d ago

No markdown/emojis, just raw text.

27

u/RipleyVanDalen We must not allow AGI without UBI 1d ago

That’s not quite what it means. It means they account for things like length and number of headers and make sure those don’t have biasing influence over the rating.

11

u/imDaGoatnocap ▪️agi will run on my GPU server 1d ago edited 1d ago

In reality, nobody cares about LMArena

Math/coding benchmarks , context window , and token pricing are the only relevant metrics

16

u/ezjakes 1d ago

I think both have their place, but if you are interested in using them for tasks requiring high intelligence or large amounts of data then sure benchmarks matter much more.

28

u/Pyros-SD-Models 1d ago

What do you mean? People who actually develop applications with LLMs absolutely care about LMArena. We have multiple apps with >100k daily users, and LMArena is the most accurate user preference check out there, it also matches our internal A/B tests.

The general population doesn't care about math, coding, context window size, or any of that. Nobody cares if the airline support chatbot can do good math, but people definitely care if it can generate visually appealing markdown, is easy to understand, and makes the user feel good.

Sometimes the sub is hilarious "I don't understand the use case of this particular benchmark, so the only explanation is, that the benchmark is stupid, can't be me"

1

u/RMCPhoto 4h ago

They shouldn't unless we can be guaranteed that the "default" system prompt is used, or if it is required that the system prompt be shared.

-16

u/imDaGoatnocap ▪️agi will run on my GPU server 1d ago edited 1d ago

Here is what I mean, and I will be abundantly clear

This sub is called r/singularity, we are interested in developing super intelligence, not slop SaaS apps with a "AI ✨" button

(no hate to u or your apps btw, make that money)

-9

u/0xFatWhiteMan 1d ago

All the models are extremely good now, lmarena seems outdated, I am developing an app and I don't care about it.

Pricing, window size, API, I care about

1

u/haha0542 3h ago

That's simply not true. Window size without proper evaluation is simply meaningless. Missing in the middle is already a well-known issue for these LLMs, you can have infinite window, but it turns out Llama4 tends to easily forget so what's the point of having a 10m context?

I'm curious what kind of app or products can be or have been built on vanilla Llama models.

1

u/MINIMAN10001 17h ago

I mean even then Math/coding will only be one possible use of an LLM. There are numerous focuses and LLM can have and tracking that down for what is considered SOTA in a specific use case is important.

Context window and token price and simply two other metrics that at the very least have a wide scope of use across many use cases.

1

u/imDaGoatnocap ▪️agi will run on my GPU server 17h ago

You need math and code to automate ML research which is key for acceleration.

1

u/jjonj 1d ago

Pokémon might be the benchmark i care the most about now. the raw intelligence alone won't take us to agi

0

u/cashmate 23h ago

There are too many people shitting on LMarena because their favorite autistic math/coding models aren't scoring as high on there as they expect.

1

u/RMCPhoto 4h ago

LMarena is "fun" and you get an idea for the big differences between some classes of models. Once it gets into the sub 100 point difference it's entertaining to see how all of the different companies try to edge ahead, but it's not necessarily the top model that takes first place.

It's like F1, there's the machine and the driver and the fastest car doesn't always win. Similarly, lmarena can be gamed pretty heavily via the system prompt to provide the style of answer that people prefer more often.

Answers with a lot of formatting that are longer are generally voted for more often even if they aren't technically as good.

2

u/Professional_Job_307 AGI 2026 1d ago

The score on GPQA diamond says otherwise. But I'm open to seeing more benchmarks, all we have now are benchmarks from meta themselves.

2

u/Dear-One-6884 ▪️ Narrow ASI 2026|AGI in the coming weeks 1d ago

Maverick's competition seems to be Gemini 2.0 Flash, which is already pretty GOATed. If it can beat it + is local then it's a game changer.

3

u/Present-Boat-2053 1d ago

Let's normalize using style control

2

u/Present-Boat-2053 1d ago

Gemini exp 1206 is half a year old😢

1

u/Notallowedhe 1d ago

Use livebench not lmarena

1

u/meister2983 1d ago edited 1d ago

It's doing pretty well on hard prompts style controlled, tied with 3.7 sonnet thinking, but the confidence interval is huge. Coding is strong as well.

But again who knows. Lmareana usefulness has been breaking down the last few months. I want to see livebench.

I played around with it on meta.ai. Not really impressed. Sonnet 3.6 level maybe?

1

u/Ok-Weakness-4753 1d ago

IT'S ABOUT EFFICIENCY. 17B MAN! 17B! Compare it with QWQ 32b

0

u/Healthy-Nebula-3603 1d ago

Not surprised ... Looking on benchmarks Scout 109b is worse than llama 3.3 70b ...

3

u/Aggressive-Physics17 1d ago

Saying that 4 Scout is worse on benchmarks than 3.3 70B isn't accurate because:

MMMU & MMMU Pro & MathVista & ChartQA & DocVQA:
69.4%, 52.2%, 70.7%, 88.8%, 94.4% (LLaMa 4 Scout)
Not multimodal (LLaMa 3.3 70B & LLaMa 3.1 405B)

LiveCodeBench (pass@1):
33.3% (LLaMa 3.3 70B) - +1.5% over 4 Scout
32.8% (LLaMa 4 Scout)

MMLU-Pro:
74.3% (LLaMa 4 Scout) - +1.4% over 3.1 405B
73.3% (LLaMa 3.1 405B) - +6.4% over 3.3 70B
68.9% (LLaMa 3.3 70B)

GPQA Diamond:
57.2% (LLaMa 4 Scout) - +12.8% over 3.1 405B
50.7% (LLaMa 3.1 405B) - +0.4% over 3.3 70B
50.5% (LLaMa 3.3 70B)

-1

u/Healthy-Nebula-3603 1d ago

But you not consider Scout is 50% bigger and much newer and should be more advanced

0

u/RMCPhoto 4h ago

Scout is a Moe model with lower computational costs due to 17b active, so it's easier to host for the big companies. The 10million context is also interesting and could unlock some very unique use cases if it's functional and has an efficient caching method. (think the entry-poiny to a rag system + large reasoning model as the output step).

Llama 4 is an improvement over llama 3.

Llama 3.3 also came out in December 2024. Llama 3 came out in April 2024.

Llama 3.3 is a significant improvement, and I suspect there will be some big improvements to llama 4 soon enough.

It's possible to run llama 4 with a CPU and RAM for less money and electrical costs than it would take to run llama 3.3 with 2x3090+

-4

u/Present-Boat-2053 1d ago

Look at this 100 year old gpt 4o 11-20 being better

6

u/iperson4213 1d ago

4o 2024-11-20, as the name suggests, was release in november 2024, under 5 months ago

2

u/Defiant-Mood6717 1d ago

5 months is 100 years in AI time

LLM News Llama 4 Maverick is lmarena maxed and in reality worse than models that are half a year old

You are about to leave Redlib