Qwen3-Next

557 points by tosh 2 days ago

jychang 2 days ago

Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.

Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...

But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.

Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...

So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.

puilp0502 2 days ago

What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
- jychang 2 days ago
  
  Speculative decoding! It makes inference a LOT faster.
  Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
  If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.
  - stingraycharles a day ago
    
    Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea?
    I’m not an expert on LLMs, just a user.
    
    tomp a day ago
    
    No, the parent is wrong.
    Checking a token is the same as generating it.
    The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.
    
    bigwheels a day ago
    
    Thanks for the clarification. Your comment made me connect the similarity (in spirit) of Speculative Decoding to Speculative Execution [1] in CPUs. Very cool and clever optimization strategy for LLMs, IMHO.
    [1] https://en.wikipedia.org/wiki/Speculative_execution
    Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is.
    
    bdcs a day ago
    
    It relies on an “unintuitive observation”[0] that you can run batches basically for free (up to a limit). So if you only run one inference, you batch it plus a lot of guesses and, if you guess right, can speed up the inference by the number of guesses. If you guess wrong, you're back to regular speed (and still fully correct).
    [0] https://x.com/karpathy/status/1697318534555336961
    
    namibj a day ago
    
    Basically you can generate the next two tokens at once in the same matmul, and rollback to one-at-a-time when your generation said you guessed wrong (as that will mean the second of your pair you generated was generated based on revoked context).
    
    Zacharias030 16 hours ago
    
    yes, if you know the sequence of tokens ahead of time you can verify them about as quickly as you can generate one more token because of the parallelism benefits.
    If you don’t know the future tokens though, then you can’t, and blind guessing of tokens is infeasible because the vocabulary contains circa 100k possible different tokens.
  - moffkalast 2 days ago
    
    Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions?
    
    jychang a day ago
    
    Because if you generate token n+1 with all 48 layers of Qwen3-Next and 80 billion params, and also generate token n+2 with the 1 MTP layer at 2bil params... that n+2 token can be much lower quality than the n+1 token but mostly correct.
    Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
    
    littlestymaar a day ago
    
    > Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
    That doesn't match my understanding of what speculative decoding does: AFAIK with regular speculative decoding you ask a smaller llm infer the next few tokens (let say 5 tokens) and then, you can have the big model infer token 1, 2, 3, 4, 5 and 6 in parallel (each time starting from the sentence partially completed by the smaller model). Because llms are bandwidth bound, doing the same work six times in parallel isn't slower than doing it only once (what's costly is moving the massive model weights between VRAM and the GPU cores).
    If token 1,2 and 3 match what the small models inferred, then you keep them. As soon as you have a mismatched token (say token 4) it means that you have to discard the next inferred tokens (here token 5 and 6) because they were calculated under a wrong assumption for token 4.
    So if the MTP layer merely replace the smaller llm in the previous scheme with everything else working the same way, you would save anything when inferring “Obama” (you'd still need to “generate it the regular way”, as there isn't really another way) but you could also start working on the word immediately after “Obama” by assuming “Obama” was already chose. And if the model actually outputted “Hussein” instead of “Obama”, then the token calculated to happen after “Obama” would have to be discarded.
    Or maybe my understanding of speculative decoding is completely off…
    
    vman512 a day ago
    
    Sounds right. The policy for rejection can depend on what you want - you might accept the top K highest probability tokens or top P probability mass. Or you can do something like importance sampling and probabilistically reject based on the ratio of likelihoods
    
    SonOfLilit 2 days ago
    
    If you ask me to guess an answer, I'll _usually_ produce the same answer as if I had time to think about it deeply, but not always...
    
    EMM_386 a day ago
    
    I believe it's something along these lines. The MTP head runs simultaneously and generates a probability list based on what it thinks the results will be, learned during training.
    If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90) If n+1 = "The" then n+2 = "quick" (confidence: 0.45) If n+1 = "President" then n+2 = "Biden" (confidence: 0.75)
    A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough".
    
    namibj a day ago
    
    Well yeah; also inference benefits massively from batching, so you use the guesses to pre fill context needed to infer the next speculated tokens, and if the guesses were wrong, you just have to re-compute the speculated ones that depended on the guessed context.
    You compute the next token and guess the one after; then you try to take the guess for real and compute the one after together with running inference for the guessed one, and the one after is speculated on the guess being correct.
    
    eldenring a day ago
    
    the 2nd token is generated without knowing what token was chosen for the 1st token
- rfoo 2 days ago
  
  It could be a better draft model than separately trained EAGLE etc for speculative decoding.
- cubefox a day ago
  
  > What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
  It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...
  - Zacharias030 15 hours ago
    
    There is no reason that it couldn’t be beneficial for training though.
    
    cubefox 7 hours ago
    
    Except that speculative decoding is de facto only an inference time optimization. But the H-Net architecture from the previous reference, which doesn't require tokens or speculative decoding, does something similar both for inference and training.
humblyCrazy a day ago

How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?
Razengan a day ago

Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')
- lcnPylGDnU4H9OF a day ago
  
  The best primer I've seen is Andrej Karpathy's first video in his "zero to hero" series. It's worth following along with your own practice.
  https://karpathy.ai/zero-to-hero.html
  - Razengan a day ago
    
    A while ago I had caught up with the basics thanks to the legendary 3blue1brown and his playlist on Neural Networks:
    https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ...
    and other helpers like Artem Kirsanov:
    https://www.youtube.com/watch?v=SmZmBKc7Lrs
- vessenes a day ago
  
  Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.
  Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!
  - littlestymaar a day ago
    
    > Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate.
    I don't think it move this fast.
    I mean there is very little fundamental differences between GPT-2 and gpt-oss-120b, it's just about incremental improvement that don't change much to the full picture (using a variation of the attention architecture and masking, a different activation function, the positional encoding and changing the NLP layers to a sparse “mixture of expert”), at the end of the day, from Mistral to Deepseek going through llama and Qwen3 it's always the same stack of transformers layers with slight variations between two architectures.
    This Qwen3-Next is special though, as it's the first time a major player is releasing something that different (lesser players have made hybrid architecture LLMs for the past two years, but when it comes to language models, IBM really isn't comparable to Alibaba). This is what I expected Llama4 to be.
- porridgeraisin a day ago
  
  Background:
  LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).
  ---
  When training, you make an LLM learn that
  I use arch = downscale(upscale(I use))
  If you want to predict the next word after that, you do next in sequence the following:
  I use arch btw = downscale(upscale(I use arch))
  Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.
  i.e in parallel:
  I use arch = downscale1(upscale(I use))
  I use ____ btw = downscale2(upscale(I use))
  However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.
  What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.
  So overall, you save params.
  Concretely,
  Before: downscale1.params + downscale2.params
  After: downscale_common.params + lightweight1.params + lightweight2.params
  Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.
  - pmarreck a day ago
    
    so after your edit it would be (just to clarify):
    I use ____ ___ = downscale_common(lightweight1(.)) + downscale_common(lightweight2(.)) ?
    And does it generate 2 at a time and keep going that way, or is there some overlap?
    
    porridgeraisin a day ago
    
    You generate blocks of 2 at a time yes. In general, k. As you can imagine, larger k performs worse. LLM(I like cats) is very likely to continue with "because they", but beyond that, there's too many possibilities. LLM(I like cats because they are) = small and cute and they meow, while LLM(I like cats because they eat) = all the rats in my garden.
    If you try to predict the whole thing at once you might end up with
    I like cats because they are all the rats and they garden
    > Overlap
    Check out an inference method called self-speculative decoding which solves(somewhat) the above problem of k-token prediction, which does overlap the same ___ across multiple computations.
  - losvedir a day ago
    
    Ooooh, neat! That was very well explained, thank you.
  - Razengan a day ago
    
    > I see you have a gamedev background
    Thanks for the tailored response! ^^
  - JSR_FDED a day ago
    
    Really good
  - fortyseven a day ago
    
    Dude, this was like that woosh of cool air on your brain when an axe splits your head in half. That really brought a lot of stuff into focus.
- wickedsight a day ago
  
  For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.
- pmarreck a day ago
  The following was generated by chatG5:
  Qwen3-Next — A family of large language models from Qwen (Alibaba). DeepSeek R1 — Another large open-source language model from DeepSeek AI. Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper. MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up. Embedding — Converts words/tokens into vectors (numbers) the model can work with. Un-embedding — The reverse step: mapping the model’s internal vector back into tokens. embed_tokens — The big lookup table of embeddings (token → vector). shared_head.head tensors — Extra weight matrices used for prediction; they can be huge. [129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension). FP8 — Floating-point format using 8 bits (compact, faster, less precise). Active parameters — The weights that actually need to be loaded in GPU memory to run the model. Inference — Running the model to generate text (as opposed to training it). GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.

Alifatisk a day ago

Alibaba keeps releasing gold content

I just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysis

According to Qwen Chat, Qwen3-Next has the following limits:

Maximum context length: 262,144 tokens

Max summary generation length: 32,768 tokens

This is 2x higher on context length and 4x higher on summary generation compared to Qwen3-235B-A22B, damn

> Qwen3-Next [...] excels in ultra-long-context understanding and complex tasks

Even though their new hybrid architecture is fascinating, I think I'll continue to stick with Qwen2.5-Turbo because it's one of the few models that supports 1M tokens in context length. My use case is uploading large pdfs and ask questions across chapters

gizmodo59 a day ago

My take on long context for many frontier models is not about support but the accuracy drops drastically as you increase the context. Even if a model claims to support 10M context, reality is it doesn’t perform well when you saturate. Curious to hear others perspective on this
- kridsdale3 a day ago
  
  This is my experience with Gemini. Yes, I really can put an entire codebase and all the docs and pre-dev discussions and all the inter-engineer chat logs in there.
  I still see the model becoming more intoxicated as turn count gets high.
  - patates a day ago
    
    I use repomix to pack a full repository as an xml file and it works wonders. System prompt is very simple:
    please don't add any comments in the code unless explicitly asked to, including the ones that state what you changed. do not modify/remove any existing comments as long as they are valid. also output the full files that are changed (not the untouched ones), and no placeholders like "no change here" etc. do not output the xml parts in the output.xml file. focus on the individual files. before and after outputting code, write which file it would be and the path (not as a comment in the code but instead, before and after outputting code).
    Attached is a 400k token xml file, being the output of:
    https://pastebin.com/raw/SH6JHteg
    Main prompt is a general description of the feature needed and PDF exports from figma.
    All done for free in aistudio and I consistently get better results than the people using claude code.
- vessenes a day ago
  
  Agreed. That said, in general a 1M context model has a larger usable window than a 260k context model.
pilotneko a day ago

If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.
> Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method.
Source: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct#proc...
- Alifatisk a day ago
  
  > If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.
  I read the article, but as I said Qwen chat only provides up to 262k tokens in context length, so I'll stick with Qwen2.5 Turbo which supports 1M tokens.
  I am not in a position where I can self-host yet
davidweatherall a day ago

Their proprietary models are very good too and go under the radar, they never seem to appear on any benchmarks. Qwen3-coder-plus is significantly better than their open source qwen3, Qwen3 max also rivals the SOTA models
- ehsanu1 a day ago
  
  Are these actually different models vs just different names from the open weights releases?
  - Havoc a day ago
    
    They generally match, except I don't think the Max ones have releases
cpursley a day ago

How are you prepping the PDF data before shoving it into Qwen?
- Alifatisk a day ago
  
  I just compress the file size as low as possible without losing the quality, didn't even know there was more ways to prep it.
  I do sometimes chop up the PDF into smaller pdfs with their own individual chapters
  - amelius a day ago
    
    On Linux you can use pdftotext also if you are only concerned with the text.
- navbaker a day ago
  
  Not OP, but we use the docling library to extract text and put it in markdown before storing for use with an LLM.

irthomasthomas 2 days ago

llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"

Here's a classic ASCII art representation of SpongeBob SquarePants:

     .------.
    /  o o  \
   |         |
   |  \___/  |
    \_______/


  llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  "An ASCII of spongebob"

  Here's an ASCII art of SpongeBob SquarePants:

      .--..--..--..--..--..--.
    .' \  (`._   (_)     _   \
  .'    |  '._)         (_)  |
  \ _.')\      .----..--.'  /
  |(_.'  |    /    .-\-.  \---.
  \     0|    |   ( O| O) |    |
   |  _  |  .--.____.'._.-.  |
   \ (_) | o         -` .-`  |
    |    \   |`-._ _ _ _ _\ /
    \    |   |  `. |_||_|   |
     | o |    \_       \    |     -.   .-.
     |.-.  \    `--..-'   O |     `.`-' .'
   _.'  .' |     `-.-'      /-.__   ' .-'
  .' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
  `-._  `.  |________/\_____|    `-.'
     .'   ).| '=' '='\/ '=' |
     `._.`  '---------------'
             //___\   //___\
               ||       ||
               ||_.-.   ||_.-.
              (_.--__) (_.--__)

Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?

Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.

dev_hugepages 2 days ago

memorized: https://www.asciiart.eu/cartoons/spongebob-squarepants
- irthomasthomas 2 days ago
  
  Naturally. That's how LLMs work. During training you measure the loss, the difference between the model output and the ground-truth and try to minimize it. We prize models for their ability to learn. Here we can see that the large model does a great job at learning to draw bob, while the small model performs poorly.
  - ACCount37 a day ago
    
    We don't value LLMs for rote memorization though. Perfect memorization is a long solved task. We value LLMs for their generalization capabilities.
    A scuffed but fully original ASCII SpongeBob is usually more valuable than a perfect recall of an existing one.
    One major issue with highly sparse MoE is that it appears to advance memorization more than it advances generalization. Which might be what we're seeing here.
  - endymion-light 2 days ago
    
    I'd argue that actually, the smaller model is doing a better job at "learning" - in that it's including key characteristics within an ascii image while poor.
    The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.
    Maybe a pelican riding a bike as ascii for both?
  - mdp2021 a day ago
    
    > That's how LLMs work
    And that is also exactly how we want them not to work: we want them to be able to solve new problems. (Because Pandora's box is open, and they are not sold as a flexible query machine.)
    "Where was Napoleon born": easy. "How to resolve the conflict effectively": hard. Solved problems are interesting to students. Professionals have to deal with non trivial ones.
    
    dingnuts a day ago
    
    > how we want them not to work
    speak for yourself, I like solving problems and I'd like to retire before physical labor becomes the only way to support yourself
    > they are not sold as a flexible query machine
    yeah, SamA is a big fucking liar
    
    mdp2021 8 hours ago
    
    I get your fear, d., but I am afraid we urgently need them tools, and to work properly. At some point in time the gap between workforce and objectives forced us to adopt cranes; at this point in time I see that "the carbon" is not "competing" enough. An IQ boost in the toolbox, when we will finally reach it, will be an enabler: for doom in the hands of fools, for the best in the hands of the wise - proportions worrisome but the game is not decided.
    Meanwhile, there is no turning back and as the mockery of intelligence was invented, the Real Thing must be urgently found.
    Edit: I have just read the title "Amateurish plan exposed failing diplomacy". The giants' list includes McNamara, Kissinger, Brzezinski: if some say that their efforts have not been sufficient - and failures are very costly -, what do we need?
  - WhitneyLand a day ago
    
    Not really.
    Typically less than 1% of training data is memorized.
- ricardobeat a day ago
  
  For the model to have memorized the entire sequence of characters precisely, this must appear hundreds of times in the training data?
- ginko 2 days ago
  
  Conveniently removed the artist's signature though.
  - irthomasthomas 2 days ago
    
    Yes - they all do that. Actually, most attempts start well but unravel toward the end.
    llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ "An ASCII of spongebob" Here's an ASCII art of SpongeBob SquarePants: ``` .--..--..--..--..--..--. .' \ (`._ (_) _ \ .' | '._) (_) | \ _.')\ .----..--. / |(_.' | / .-\-. \ \ 0| | ( O| O) | | _ | .--.____.'._.-. /.' ) | (_.' .-'"`-. _.-._.-.--.-. / .''. | .' `-. .-'-. .-'"`-.`-._) .'.' | | | | | | | | | | .'.' | | | | | | | | | | .'.' | | | | | | | | | | .'.' | | | | | | | | | | .'.' | | | | | | | | | | .'.' | | | | | | | | | | ```
    
    cbm-vic-20 a day ago
    
    Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
    
    eMPee584 a day ago
    
    Yes, Mylord! We'll find and destroy all of that damn shrubbery!
    
    rwoerz a day ago
    
    He is dead??
    
    mlvljr a day ago
    
    Going through shredder
  - eurekin 2 days ago
    
    Certainly not defending LLMs here, don't mistake with that.
    Humans do it too. I have given up on my country's non-local information sources, because I could recognize original sources that are being deliberately omitted. There's a satiric webpage that is basically a reddit scrape. Most of users don't notice and those who do, don't seem to care.
    
    yorwba 2 days ago
    
    Yes, the most likely reason the model omitted the signature is that humans reposted more copies of this image omitting the signature than ones that preserve it.
matchcc a day ago
I think there is some distillation relationship between Kimi K2 and Qwen Coder or other related other models, or same training data. I tried most of LLMs, only kimi K2 gave the exact same ASCII. kimi K2: Here’s a classic ASCII art of SpongeBob SquarePants for you:
```
           .--..--..--..--..--..--.
        .' \  (`._   (_)     _   \
      .'    |  '._)         (_)  |
      \ _.')\      .----..---.   /
      |(_.'  |    /    .-\-.  \  |
      \     0|    |   ( O| O) | o|
       |  _  |  .--.____.'._.-.  |
       \ (_) | o         -` .-`  |
        |    \   |`-._ _ _ _ _\ /
        \    |   |  `. |_||_|   |
        | o  |    \_      \     |     -.   .-.
        |.-.  \     `--..-'   O |     `.`-' .'
      _.'  .' |     `-.-'      /-.__   ' .-'
    .' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
    `-._  `.  |________/\_____|    `-.'
       .'   ).| '=' '='\/ '=' |
       `._.`  '---------------'
               //___\   //___\
                 ||       ||
                 ||_.-.   ||_.-.
                (_.--__) (_.--__)
```
Enjoy your SpongeBob ASCII!
- nakamoto_damacy a day ago
  
  For ascii to look right, not messed up, the generator has to know the width of the div in ascii characters, e.g. 80, 240, etc, so it can make sure the lines don't wrap. So how does an LLM know anything about the UI it's serving? Is it just luck? what if you ask it to draw something that like 16:9 in aspect ratio... would it know to scale it dowm so lines won't wrap? how about loss of details if it does? Also, is it as good with Unicode art? So many questions.
  - Leynos a day ago
    
    They don't see runs of spaces very well, so most of them are terrible at ASCII art. (They'll often regurgitate something from their training data rather than try themselves.)
    And unless their terminal details are included in the context, they'll just have to guess.
    
    kingstnap a day ago
    
    Runs of spaces of many different lengths are encoded as a single token. Its not actually inefficient.
    In fact everything from ' ' to ' '79 all have a single token assigned to them on the OpenAI GPT4 tokenizer. Sometimes ' 'x + '\n' is also assigned a single token.
    You might ask why they do this but its to make it so programming work better by reducing token counts. All whitespace before the code gets jammed into a single token and entire empty lines also get turned into a single token.
    There are actually lots of interesting hand crafted token features added which don't get discussed much.

irthomasthomas a day ago

I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence

With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.

Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:

Model: chutes/Qwen/Qwen3-235B-A22B

   /~\       
  (  *  *  )  
  (  o o o  )  
   \  -  /     
    \  /\  /   
     \  /      
      \/       
     /|||\      
    /|||||\     
   /||||||||\   
  ( o   o   o ) 
   \   W   /   
    \___/

Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507

    /\_/\  
   ( o.o ) 
    > ^ <  
   /     \ 
  |       |
  |       |
   \     /
    '-'-'

Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507

  .-----------.
  | []     [] |
  |           |
  |   __  __  |
  |  |  ||  | |
  |  |__||__| |
  |           |
  '-----------'

Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct

        __
       /  \
      /    \
     /      \
    /        \
   /__________\
  |  o      o  |
  |     __     |
  |  \____/    |
  |            |
  |  ________  |
  |  \      /  |
  |   \____/   |
  |            |
  |____________|
  
     ___________
    /           \
   /             \
  |   _______     |
  |  |       |    |
  |  |  ___  |    |
  |  | |   | |    |
  |  | |___| |    |
  |  |_______|    |
  |               |
  |_______________|

Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking

  .-.
   /   \
  |  o o|
  |  >  |
  |  ---|
   \___/

Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507

    _________________________
   /                         \
  |   _     _     _     _   |
  |  / \   / \   / \   / \  |
  | |   | |   | |   | |   | |
  |  \_/   \_/   \_/   \_/  |
  |                         |
  |   _     _     _     _   |
  |  / \   / \   / \   / \  |
  | |   | |   | |   | |   | |
  |  \_/   \_/   \_/   \_/  |
  |                         |
  |    SpongeBob SquarePants   |
  |_________________________|

syntaxing 2 days ago

The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.

moffkalast 2 days ago

In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.
- jychang a day ago
  
  Strong disagree.
  Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.
  https://artificialanalysis.ai/#frontier-language-model-intel...
  Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.
  In essence, it was Deepseek R1 before Deepseek R1.
  - seunosewa a day ago
    
    He is definitely talking about Llama4.
    
    lcnPylGDnU4H9OF a day ago
    
    > last year
    > dense
    > 405B model
    Llama4 does not match any of these details. Maybe the commenter thinks their comment is about Llama4 (I don't see a reason to believe so) but readers familiar with these details know they are referring to Llama3.1.
    
    jug a day ago
    
    Llama 4 is neither from last year nor a dense model.
- NitpickLawyer a day ago
  
  It's not that clear. Yes, it underperforms in recent benchmarks and usecases (i.e. agentic stuff), but it is still one of the strongest open models in terms of "knowledge". Dense does have that advantage of MoE, even if it's extremely expensive to run inference on.
  Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...
  - moffkalast a day ago
    
    Ok wow that is incredibly interesting, what a test. I would've honestly expected just random noise (like if you gave this same task a human, lol) but you can even see related models draw similar results. Maybe it is an indicator of overall knowledge, or how consistent the world model is. It also could not correlate at all with non-geographical knowledge.
halJordan a day ago

Qwen isn't directing the forward progress of llms. SOTA llms have been moe since gpt-4. The og 4.
Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.
- refulgentis a day ago
  
  I would venture to suggest that to read it as "Qwen made MoEs in toto || first || better than anyone else" is reductive - merely, the # of experts and #s here are quite novel (70b...inferencing only 3b!?!) - I sometimes kick around the same take, but, thought I'd stand up for this. And I know what I'm talking about, I maintain a client that wraps llama.cpp x ~20 models on inference APIs

mmmllm a day ago

The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc

amelius a day ago

https://en.wikipedia.org/wiki/Jevons_paradox
- mmmllm a day ago
  
  Sure but where is the demand going to come from? LLMs are already in every google search, in Whatsapp/Messenger, throughout Google workspace, Notion, Slack, etc. ChatGPT already has a billion users.
  Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.
  - philipp-gayret a day ago
    
    If LLMs were next to free and faster I would personally increase my consumption 100x or more and Im only the "programming" category.
  - jjcm a day ago
    
    As plenty of others have mentioned here, if inference were 100x cheaper, I would run 200x inference.
    There are so many things you can do with long running, continuous inference.
    
    sipjca a day ago
    
    but what if you don't need to run it in the cloud
    
    ukuina a day ago
    
    You will ALWAYS want to use the absolute best model, because your time is more valuable than the machine's. If the machine gets faster or more capable, your value has jumped proportionally.
  - vessenes a day ago
    
    We are nearly infinitely far away from saturating compute demand for inference.
    Case in point; I'd like something that realtime assesses all the sensors and API endpoints of stuff in my home and as needed bubbles up summaries, diaries, and emergency alerts. Right now that's probably a single H200, and well out of my "value range". The number of people in the world that do this now at scale is almost certainly less than 50k.
    If that inference cost went to 1%, then a) I'd be willing to pay it, and b) there'd be enough of a market that a company could make money integrating a bunch of tech into a simple deployable stack, and therefore c) a lot more people would want it, likely enough to drive more than 50k H200s worth of inference demand.
    
    mtone a day ago
    
    Do you really need a H200 for this? Seems like something a consumer GPU could do. Smaller models might be ideal [0] as they don't require extensive world knowledge and are much more cost efficient/faster.
    Why can't you build this today?
    [0]: https://arxiv.org/pdf/2506.02153 Small Language Models are the Future of Agentic AI (Nvidia)
    
    OtherShrezzing a day ago
    
    Is all of that not achievable today with things like Google Home?
    It doesn’t sound like you need to run a H200 to bridge the gap between what currently exists and the outcome you want.
    
    mmmllm a day ago
    
    Sure but if that inference cost went to 1%, then Oracle and Nvidia's business model would be bust. So you agree with me?
    
    taminka a day ago
    
    absolutely nobody wants or needs a fucking thermostat diary lmao, and the few ppl that do will have zero noticeable impact on world's compute demands, i'm begging ppl in on hn to touch grass or speak to an average person every now and then lol
    
    arscan a day ago
    
    its pretty easy to dispute and dismiss a single use case for indiscriminate/excessive use of inference to achieve some goal, as you have done here, but its hard to dispute every possible use case
    
    pessimizer a day ago
    
    You wouldn't even know that it existed, or how it worked. It would just work. Everybody wants hands off control that they don't have to think or learn about.
    edit: this reminds me of a state agency I once worked for who fired their only IT guy after they moved offices, because the servers were running just fine without him. It was a Kafkaesque trauma for him for a moment, but a massive raise a week later when they were renegotiating for him to come back.
  - idopmstuff a day ago
    
    > Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc.
    Is that true? BLS estimates of customer service reps in the US is 2.8M (https://www.bls.gov/oes/2023/may/oes434051.htm), and while I'll grant that's from 2023, I would wager a lot that the number is still above 2M. Similarly, the overwhelming majority of software developers haven't lost their jobs to AI.
    A sufficiently advanced LLM will be able to replace most, if not all of those people. Penetration into those areas is very low right now relative to where it could be.
    
    mmmllm a day ago
    
    Fair point - although there are already so many customer facing chatbots using LLMs rolled out already. Zendesk, Intercom, Hubspot, Salesforce service cloud all have AI features built into their workflows. I wouldn't say penetration is near the peak but it's also not early stage at this point.
    In any case, AI is not capable of fully replacing customer care. It will make it more efficient but the non-deterministic nature of LLMs mean that they need to be supervised for complex cases.
    Besides, I still think even the inference demand for customer care or programming will be small in the grand scheme of things. EVERY Google search (and probably every gmail email) is already passed through an LLM - the demand for that alone is immense.
    I'm not saying demand won't increase, I just don't see how demand increases so much that it offsets the efficiency gains to such an extent that Oracle etc are planning tens or hundreds of times the need for compute in the next couple of years. Or at least I am skeptical of it to say the least.
  - mirekrusin a day ago
    
    We've seen several orders of magnitude improvements in cpus over the years, yet you try to do anything now and interaction is often slower than that on zx spectrum. We can easily fill in order of magnitude improvement and that's only going to create more demand. We can/will have models thinking for us all the time, in parallel and bother us with findings/final solutions only. There is no limit here really.
  - theptip a day ago
    
    I’m already throughput-capped on my output via Claude. If you gave me 10x the token/s I’d ship at least twice as much value (at good-enough for the business quality, to be clear).
    There are plenty of usecases where the models are not smart enough to solve the problem yet, but there is very obviously a lot of value available to be harvested from maturing and scaling out just the models we already have.
    Concretely, the $200/mo and $2k/ mo offerings will be adopted by more prosumer and professional users as the product experience becomes more mature.
  - lanza a day ago
    
    The difference in usefulness between ChatGPT free and ChatGPT Pro is significant. Turning up compute for each embedded usage of LLM inference will be a valid path forward for years.
  - adgjlsfhk1 a day ago
    
    The problem is that unless you have efficiency improvements that radically alter the shape of the compute vs smartness curve, more efficient compute translates to much smarter compute at worse efficiency.
  - amelius a day ago
    
    If you can make an LLM solve a problem but from 100 different angles at the same time, that's worth something.
    
    mmmllm a day ago
    
    Isn't that essentially how the MoE models already work? Besides, if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost?
    Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.
    
    mcrutcher a day ago
    
    MoE models are pretty poorly named since all the "experts" are "the same". They're probably better described as "sparse activation" models. MoE implies some sort of "heterogenous experts" that a "thalamus router" is trained to use, but that's not how they work.
    
    amelius a day ago
    
    > if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost
    The compute/intelligence curve is not a straight line. It's probably more a curve that saturates, at like 70% of human intelligence. More compute still means more intelligence. But you'll never reach 100% human intelligence. It saturates way below that.
    
    eMPee584 a day ago
    
    how would you know it converges on human limits, why wouldn't it be able to go beyond, especially if it gets its own world sim sandbox?
    
    amelius a day ago
    
    I didn't say that. It converges well below human limits. That's what we see.
    Thinking it will go beyond human limits is just wishful thinking at this point. There is no reason to believe it.
    
    mirekrusin a day ago
    
    MoE is something different - it's a technique to activate just a small subset of parameters during inference.
    Whatever is good enough now, can be much better for the same cost (time, computation, actual cost). People will always choose better over worse.
    
    mmmllm a day ago
    
    Thanks, I wasn't aware of that. Still - why isn't there a super expensive OpenAI model that uses 1,000 experts and comes up with way better answers? Technically that would be possible to build today. I imagine it just doesn't deliver dramatically better results.
    
    Leynos a day ago
    
    That's what GPT-5 Pro and Grok 4 Heavy do. Those are the ones you pay triple digit USD a month for.
  - takinola a day ago
    
    I mean 640KB should be enough for anyone too but here we are. Assuming LLMs fulfill the expected vision, they will be in everything and everywhere. Think about how much the internet has permeated everyday life. Even my freaking toothbrush has WiFi now! 1000x demand is likely several orders of magnitude too low in terms of the potential demand (again, assuming LLMs deliver on the promise).
  - sauwan a day ago
    
    Long running agents?
ls65536 a day ago

I'm not going to speculate about what might be ahead in regards to Oracle's forecasting of data center demand, but regarding the idea of efficiency gains leading to lower demand, don't you think something like Jevons paradox might apply here?
Voloskaya a day ago

People said the same thing for deepseek-r1, and nothing changed.
If you come up with a way to make the current generation of models 10x more efficient, then everyone just moves to train a 10x bigger model. There isn’t a size of model where the players are going to be satisfied at and not go 10x bigger. Not as long as scaling still pays off (and it does today).
stingraycharles a day ago

Absolutely not; the trends have proven that people will just pay for the best quality they can get, and keep paying roughly the same money.
Every time a new model is released, people abandon the old, lower quality model (even when it’s priced less), and instead prefer to pay the same for a better model.
The same will happen with this.
- mmmllm a day ago
  
  Sure but the money people are paying right now isn't that much in the grand scheme of things. OpenAI is expecting 13bn in revenue this year. AWS made over 100bn last year. So unless they pay a lot more, or they find customers outside of programmers, designers, etc who are willing to pay for the best quality, I don't see how it grows as fast as it needs to (I'm not saying it won't increase, just not at the rate expected by the data center providers)
- clvx a day ago
  
  For early adopters yes but many systems have been running as good enough without any kind of updates for a long time. For many use cases it needs to get to a point where accuracy is good enough and then it will be set and forget. I disagree with the approach but that's what you find in the wild.
- Zambyte a day ago
  
  The best quality you can get is at odds with the best speed you can get. There are lots of people (especially with specific use cases) who will pay for the best speed they can get that is high enough quality.
thinkingemote a day ago

If someone had to bet on an AI crash which I imagine would led to unused datacentres and cheap GPUs how would they invest their winnings to exploit these resources?
- CuriouslyC a day ago
  
  If the price of inference drops through the floor all the AI wrapper companies become instantly more valuable. Cursor is living on borrowed time because their agents suck and they're coasting on first mover advantage with weak products in general, but their position would get much better with cheap inference.
- sunir a day ago
  
  Buy the application layer near winners. When computing costs shrink, usage expands.
- kridsdale3 a day ago
  
  Assuming your question isn't rhetorical, massive Oracle Crypto Farm.
ACCount37 a day ago

No. The gains in inference and training efficiency are going to be absorbed by frontier LLM labs being more willing to push more demanding and capable models to the end users, increase reasoning token budgets, etc.
jstummbillig a day ago

For the last 2 years, despite all efficiency gains, I am literally watching characters appear on my screen, as if this was a hacker movie. Lately, I am also waiting for at least 60s for anything to appear at all.
If that happened at 10x the speed, I would still be slow in computer terms, and that increasingly matter, because I will not be the one reading the stuff – it will be other computers. I think looking back a few years from now, every single piece of silicon that is planned right will look like a laudable but laughable drop in the ocean.
mdp2021 a day ago

The real quality demand needs is not there, so more processing is very probably needed, so efficiency gains may allow the extra processing.
(A string example read today of Real quality demand needs: the administration of Albania wants some sort of automated Cabinet Minister. Not just an impartial and incorruptible algorithm (what we normally try to do with deterministic computation): a "minister". Good luck with that.)

lend000 a day ago

Prediction: AI will become commoditized ~15 IQ points higher than the state of the art models today, and with larger context, within 4 years as the incremental improvements in training from synthetic data plateaus (we've already used all the "real" data out there) and open source models are cheaply trained on the outputs of the big money models. Then AI development stagnates until someone invents an effective way to use competitive reinforcement learning to train generalized intelligence (similar to how AlphaGo was trained), removing the need for vast quantities of training data. Then, we get real AGI.

rmuratov a day ago

How did we use "all the data"? New knowledge appears on the internet every day, new scientific articles and videos are published.
- lend000 18 hours ago
  
  At the speeds AI is moving, we've effectively used it all; the high quality data you need to make smarter models is coming in at a trickle. We're not getting 10^5 Principia Mathematicas published every day. Maybe I just don't have the vision to understand it, but it seems like AI-generated synthetic data for training shouldn't be able to make a smarter model than whatever produced that data. I can imagine synthetic data would be useful for making models more efficient (that's what quantized models are, after all), but not pushing the frontier.
HardCodedBias a day ago

If that's true and if today's frontier models are around 120 IQ (who knows if that is true, but let's run with it, source: https://www.trackingai.org/home) then we'll have an enormous number of ~135 IQ bots with nearly unlimited conscientiousness.
I can't even begin to understand what that would mean.
- lend000 18 hours ago
  
  Very interesting time to be alive.

mynti 2 days ago

For anyone curious about what the Gated Delta Network is: https://arxiv.org/pdf/2412.06464

yorwba 2 days ago

Also, Gated Attention: https://arxiv.org/abs/2505.06708

jbellis a day ago

Added Qwen3 Next to the Brokk Power Ranking Open Round (coding benchmark). It's roughly GPT-OSS-20b strength.

Full set of open weight model results: https://brokk.ai/power-ranking?version=openround&models=ds-r...

noahbp a day ago

Is that the updated Kimi K2, or the old Kimi k2?
- jbellis 9 hours ago
  
  It's the original. I'll update the label to clarify.
SparkyMcUnicorn a day ago

This would be a valuable benchmark if it included languages other than Java, and let me see which models are best at the languages I work with.
My real-world usage does not line up with these results, but I'm not working with Java.

Jgoauh 2 days ago

Seems impressive, i believe better architectures are really the path forward, i don't think you need more than 100B params taking this model and what GPT OSS 120B can acchieve

CuriouslyC a day ago

We definitely need more parameters, low param models are hallucination machines, though low actives is probably fine assuming the routing is good.
NitpickLawyer 2 days ago

New arch seems cool, and it's amazing that we have these published in the open.
That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.
- vintermann 2 days ago
  
  Could you give some practical examples? I don't know what Qwen's 36T-token training set is like, so I don't know what it's overfitting to...
  - NitpickLawyer 2 days ago
    
    Take math and coding for example:
    - in math, if they can solve a problem, or a class of problems, they'll solve it. If you use a "thinking" model + maj@x, you'll get strong results. But if you try for example to have the model consider a particular way or method of exploring a problem, it'll default to "solving" mode. It's near impossible to have it do something else with a math problem, other than solving it. Say "explore this part, in this way, using this method". Can't do it. It'll maybe play a bit, but then enter "solving" mdoe and continue to solve it as it was trained.
    In practice, this means that "massive parallel" test time compute becomes harder to do with these models, because you can't "guide" them towards certain aspects of a problem. They are extremely "stubborn".
    - in coding it's even more obvious. Ask them to produce any 0shot often tested and often shown things (spa, game, visualisation, etc) - and they do it. Convincingly.
    But ask them to look at a piece of code and extract meaning, and they fail. Or ask them to reverse an implementation. Figure out what a function does and reverse its use, or make it do something else, and they fail.
    
    vintermann 2 days ago
    
    Oof, that sounds frustrating. Yeah, I can relate to this failure mode, it's basically "did you mean (more likely query)" up to 11.
    It does sound like an artifact of the dialog/thinking tuning though.
    
    CuriouslyC a day ago
    
    That's the thing people miss that's so good about GPT5. It's incredibly steerable in a way a lot of models aren't.
    
    elbear 2 days ago
    
    It sounds like some people.

jwr 2 days ago

Hmm. 80B. These days I am on the lookout for new models in the 32B range, since that is what fits and runs comfortably on my MacBook Pro (M4, 64GB).

I use ollama every day for spam filtering: gemma3:27b works great, but I use gpt-oss:20b on a daily basis because it's so much faster and comparable in performance.

jabart a day ago

Can you talk more about how you are using ollama for spam filtering?
- jwr 15 hours ago
  
  I wrote a little thing that connects to my IMAP server (I run my own E-mail), goes through the unread E-mails in the inbox, processes them (process MIME multipart, extract HTML, describe images and links, etc) and feeds them to an LLM with a prompt. The LLM decides if the message is spam or not.
  It's amazingly accurate.
  The interesting thing is that after experimentation I found that it's best if the prompt doesn't describe what is spam. The LLMs are somewhat "intelligent", so the prompt now describes me — who I am, what I do, my interests, etc. It's much more effective and generalizes better to fight new kinds of spam.
  And a nice side observation is that this kind of system requires no training (so I no longer collect samples of spam) and can't be gamed, because it describes me instead of describing specific kinds of spam.
  I have to write it up in a blog post.
bigyabai a day ago

The model is 80b parameters, but only 3b are activated during inference. I'm running the old 2507 Qwen3 30B model on my 8gb Nvidia card and get very usable performance.
- jwr 15 hours ago
  
  I understand that, but whether it's usable depends on whether ollama can load parts of it into memory on my Mac, and how quickly.
  - bigyabai 6 hours ago
    
    I really do not suggest ollama. It is slow, missing tons of llama.cpp features and doesn't expose many settings to the user. Koboldcpp is a much better inference provider and even has an ollama-compatible API endpoint.
- coolspot a day ago
  
  Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.
  - drozycki a day ago
    
    The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.
electroglyph a day ago

it'll run great, it's an moe.

croemer 2 days ago

https://archive.is/JH9XL

KronisLV a day ago

> The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507, and shows clear advantages in tasks requiring ultra-long context (up to 256K tokens).

This is pretty impressive and a bit like how the GPT-OSS-120B came out and scored pretty well on the benchmarks despite its somewhat limited size.

That said, using LLMs for software dev use cases, I wouldn't call 256K tokens "ultra-long" context, I regularly go over 100K when working on tasks with bigger scope, e.g.:

  Look at the existing code related to this functionality and the existing design patterns in the code as well as the guidelines.
  Then plan out the implementation in detail and ask me a few questions along the way to figure the details out better.
  Finally, based on everything so far, do the actual implementation.
  Then look it over and tell me if anything has been missed from the plan, then refactor the code in any number of ways.

It could be split up into multiple separate tasks, but I find that the context being more complete (unless the model starts looping garbage, which poisons the context) leads to better results.

My current setup of running Qwen3 Coder 480B on Cerebras bumps into the 131K token limit. If not for the inference speed there (seriously great) and good enough model quality, I'd probably look more in the direction of Gemini or Claude again.

slimebot80 2 days ago

Complete newbie here - some questions, if I may!

This stuff can run on a local machine without internet access, correct?

And it can pretty much match Nano Banana? https://github.com/PicoTrex/Awesome-Nano-Banana-images/blob/...

Also -- what are the specs for a machine to run it (even if slowly!)

NitpickLawyer 2 days ago

This model can be run completely offline, yes. You'll need anywhere from 60-200 gb of RAM (either VRAM for high speeds, or a combination of VRAM and RAM, or just CPU+RAM). The active params are really low (3B) so it'll likely run fine even on CPU. Should get 10-15+t/s even on old DDR4 systems. Offload some experts to a GPU (can be as low as 8-16gb) and you'll see greater speeds.
This has nothing to do with nano banana, or image generation. For that you want the qwen image edit[1] models.
1 - https://huggingface.co/Qwen/Qwen-Image-Edit
prawel 2 days ago

what you mean is Qwen Image and Qwen Image Edit, you can run it on local machine, using Draw Things application for example.
the model discussed here is text model, so similar to ChatGPT. You can also run it on your local machine, but not yet, as apps need to be updated with Qwen 3 Next support (llama.cpp, Ollama, etc)
dragonwriter 2 days ago

> This stuff can run on a local machine without internet access, correct?
Yes.
> And it can pretty much match Nano Banana?
No, Qwen3-Next is not a multimodal model, it has no image generation function.
Davidzheng 2 days ago

Isn't this one a text model
- slimebot80 2 days ago
  
  Ah, maybe! I am lost reading this page with all the terminology
  - arcanemachiner 2 days ago
    
    You'll get used to it.
    Make sure to lurk on r/LocalLlama.
    
    diggan a day ago
    
    > Make sure to lurk on r/LocalLlama.
    Please do take everything you read there with a bit of salt though, as the "hive-mind" effect is huge there, even when compared to other subreddits.
    I'm guessing the huge influx of money + reputations on the line + a high traffic community is ripe for both hive-minding + influence campaigns.

davidpfarrell a day ago

Hyped for the release, but bummed they fell for the ‘next’ naming convention.

What will the actual next advanced release be called:

* next-next

* next (2)

* actual-next-final

binary132 a day ago

I’ve been using gpt-oss-120B with CPU MoE offloading on a 24GB GPU and it’s very usable. Excited to see if I can get good results on this now!

kristopolous a day ago

I was getting a bunch of strange hallucinations and weird dialog. It sounds like some exasperated person on the verge of a mental breakdown

techsystems 2 days ago

How does the context length scaling at 256K tokens compare to Llama's 1M in terms of performance? How are the contexts treated differently?

jug a day ago

> The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507

I'm skeptical about these claims. How can this be? Wouldn't there be massive loss of world knowledge? I'm particularly skeptical because a recent trend in Q2 2025 has been benchmaxxing.

dragonwriter a day ago

> I'm skeptical about these claims. How can this be?
More efficient architecture.
> Wouldn't there be massive loss of world knowledge?
If you assume equally efficient architecture and no other salient differences, yes, that’s what you’d expect from a smaller model.
- jug a day ago
  
  Hmm. Let's just say if this is true, that this is actually better with such a much lower total parameter count, it's the greatest accomplishment in over a year of LLM development. With the backdrop of bechmaxxing in 2025, I'll believe in this when I see the results on closed benchmarks and SimpleBench. My concern is this might be a hallucination machine.
  - KaoruAoiShiho a day ago
    
    In my testing this model is quite bad and far behind 235b a22b. https://fiction.live/stories/Fiction-liveBench-Sept-12-2025/...
  - bigyabai a day ago
    
    Might be. FWIW, my experience with the Qwen3 30b model basically took ChatGPT out of rotation for me. It's not hard for me to imagine an 80b model pushing that further, especially with thinking enabled.
    I recommend playing with the free hosted models to draw your own conclusions: https://chat.qwen.ai/

esafak a day ago

ICYMI qwen3-max was released last week.

Alifatisk a day ago

Was Qwen3-max better than Qwen3-235B-A22B-2507 at anything? Except higher token limit?
- esafak 20 hours ago
  
  According to benchmarks. Scroll to the bottom of https://lmarena.ai/leaderboard/

cchance a day ago

Those rope tests are impressive AF

pveierland 2 days ago

> "The content loading failed."

It's amazing how far and how short we've come with software architectures.

yekanchi 2 days ago

how much vram it requires?

NitpickLawyer 2 days ago

A good rule of thumb is to think that one param is one unit of storage. The "default" unit of storage these days is bf16 (i.e. 16 bits for 1 weight). So for a 80B model that'll be ~160GB of weights. Then you have quantisation, usually in 8bit and 4bit. That means each weight is "stored" in 8bits or 4bits. So for a 80B model that'll be ~80GB in fp8 and ~40GB in fp4/int4.
But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.
So you'll see in practice that you need 20-50% more RAM than this rule of thumb.
For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).
- johntash a day ago
  
  Can you explain how context fits into this picture by any chance? I sort of understand the vram requirement for the model itself, but it seems like larger context windows increases the ram requirement by a lot more?
- theanonymousone 2 days ago
  
  But the RAM+VRAM can never be less than the size of the total (not active) model, right?
  - NitpickLawyer 2 days ago
    
    Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.
    That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.
    
    theanonymousone a day ago
    
    Can you give me a name please? Is that distributed llama or something else?
    
    skirmish a day ago
    
    I have not used it but this is probably it: https://github.com/lyogavin/airllm
DiabloD3 2 days ago

Thats not a meaningful question. Models can be quantized to fit into much smaller memory requirements, and not all MoE layers (in MoE models) have to be offloaded to VRAM to maintain performance.
- yekanchi 2 days ago
  
  i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?
  - DiabloD3 a day ago
    
    Same calculation, basically. Any given ~30B model is going to use the same VRAM (assuming loading it all into VRAM, which MoEs do not need to do), is going to be the same size
  - EnPissant 2 days ago
    
    MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.
    
    regularfry 2 days ago
    
    This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.
    
    mcrutcher a day ago
    
    Also, though nobody has put the work in yet, the GH200 and GB200 (the NVIDIA "superchips" support exposing their full LPDDR5X and HBM3 as UVM (unified virtual memory) with much more memory bandwidth between LPDDR5X and HBM3 than a typical "instance" using PCIE. UVM can handle "movement" in the background and would be absolutely killer for these MoE architectures, but none of the popular inference engines actually allocate memory correctly for these architectures: cudaMallocManaged() or allow UVM (CUDA) to actually handle movement of data for them (automatic page migration and dynamic data movement) or are architected to avoid pitfalls in this environment (being aware of the implications of CUDA graphs when using UVM).
    It's really not that much code, though, and all the actual capabilities are there as of about mid this year. I think someone will make this work and it will be a huge efficiency for the right model/workflow combinations (effectively, being able to run 1T parameter MoE models on GB200 NVL4 at "full speed" if your workload has the right characteristics).
    
    EnPissant 2 days ago
    
    What you are describing would be uselessly slow and nobody does that.
    
    DiabloD3 a day ago
    
    I don't load all the MoE layers onto my GPU, and I have only about a 15% reduction in token generation speed while maintaining a model 2-3 times larger than VRAM alone.
    
    EnPissant a day ago
    
    The slowdown is far more than 15% for token generation. Token generation is mostly bottlenecked by memory bandwidth. Dual channel DDR5-6000 has 96GB/s and A rtx 5090 has 1.8TB/s. See my other comment when I show 5x slowdown in token generation by moving just the experts to the CPU.
    
    DiabloD3 16 hours ago
    
    I suggest figuring out what your configuration problem is.
    Which llama.cpp flags are you using, because I am absolutely not having the same bug you are.
    
    EnPissant 16 hours ago
    
    It's not a bug. It's the reality of token generation. It's bottlenecked by memory bandwidth.
    Please publish your own benchmarks proving me wrong.
    
    furyofantares a day ago
    
    I do it with gpt-oss-120B on 24 GB VRAM.
    
    EnPissant a day ago
    
    You don't. You run some of the layers on the CPU.
    
    furyofantares a day ago
    
    You're right that I was confused about that.
    LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.
    
    EnPissant 21 hours ago
    
    FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU.
    
    furyofantares 21 hours ago
    
    Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters.
    
    furyofantares 18 hours ago
    
    ^ Er, misspoke, each expert is at most .9 B parameters there's 128 experts. 5.1 B is number of active parameters (4 experts + some other parameters).
    
    EnPissant 20 hours ago
    
    It makes absolutely no sense to do what OP described. The decode stage is bottlenecked on memory bandwidth. Once you pull the weights from system RAM, your work is almost done. To then gigabytes of weights PER TOKEN over PCIE to do some trivial computation on the GPU is crazy.
    What actually happens is you run some or all of the MoE layers on the CPU from system RAM. This can be tolerable for smaller MoE models, but keeping it all on the GPU will still be 5-10x faster.
    I'm guessing lmstudio gracefully falls back to running _soemthing_ on the CPU. Hopefully you are running only MoE on the CPU. I've only ever used llama.cpp.
    
    furyofantares 18 hours ago
    
    I tried a few things and checked CPU usage in Task Manager to see how much work the CPU is doing.
    KV Cache in GPU and 36/36 layers in GPU: CPU usage under 3%.
    KV Cache in GPU and 35/36 layers in GPU: CPU usage at 35%.
    KV Cache moved to CPU and 36/36 layers in GPU: CPU usage at 34%.
    I believe you that it doesn't make sense to do it this way, it is slower, but it doesn't appear to be doing much of anything on the CPU.
    You say gigabytes of weights PER TOKEN, is that true? I think an expert is about 2 GB, so a new expert is 2 GB, sure - but I might have all the experts for the token already in memory, no?
    
    EnPissant 17 hours ago
    
    gpt-oss-120b chooses 4 experts per token and combines them.
    I don't know how lmstudio works. I only know the fundamentals. There is not way it's sending experts to the GPU per token. Also, the CPU doesn't have much work to do. It's mostly waiting on memory.
    
    furyofantares 15 hours ago
    
    > There is not way it's sending experts to the GPU per token.
    Right, it seems like either experts are stable across sequential tokens fairly often, or there's more than 4 experts in memory and it's stable within the in-memory experts for sequential tokens fairly often, like the poster said.
    
    zettabomb a day ago
    
    llama.cpp has built-in support for doing this, and it works quite well. Lots of people running LLMs on limited local hardware use it.
    
    EnPissant a day ago
    
    llama.cpp has support for running some of or all of the layers on the CPU. It does not swap them into the GPU as needed.
    
    regularfry a day ago
    
    It's neither hypothetical nor rare.
    
    EnPissant a day ago
    
    You are confusing running layers on the CPU.
    
    bigyabai a day ago
    
    I run the 30B Qwen3 on my 8GB Nvidia GPU and get a shockingly high tok/s.
    
    EnPissant a day ago
    
    For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits:
    - Prompt processing 65k tokens: 4818 tokens/s
    - Token generation 8k tokens: 221 tokens/s
    If I offload just the experts to run on the CPU I get:
    - Prompt processing 65k tokens: 3039 tokens/s
    - Token generation 8k tokens: 42.85 tokens/s
    As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.
    
    littlestymaar a day ago
    
    AFAIK many people on /r/localLlama do pretty much that.

pzo a day ago

would be interesting how they compare to gpt-oss-120b. The latter one runs also very fast and pricing is currently much better than qwen3-next on many providers. Would expect that if this model is such fast pricing should be similar or even lower.

Western0 a day ago

where is gguf?

gre a day ago

https://github.com/ggml-org/llama.cpp/issues/15940
daemonologist a day ago

Patience, it just came out yesterday and has some architectural changes.

boxboxbox4 a day ago

[dead]

pollre a day ago

[dead]

keyle 2 days ago

For a model that can run offline, they've nailed how the website can too.

And it appears like it's thinking about it! /s

croemer 2 days ago

ERR_NAME_NOT_RESOLVED

siliconc0w a day ago

All these new datacenters are going to be a huge sunk cost. Why would you pay OpenAI when you can host your own hyper efficient Chinese model for like 90% less cost at 90% of the performance. At that is compared to today's subsidized pricing, which they can't keep up forever.

hadlock a day ago

Eventually Nvidia or a shrewd competitor will release 64/128gb consumer cards; locally hosted GPT 3.5+ is right around the corner, we're just waiting for consumer hardware to catch up at this point.
- mft_ 13 hours ago
  
  I think we're still at least an order of magnitude away (in terms of affordable local inference, or model improvements to squeeze more from less, or a combination of the two) from local solutions being seriously competitive for general purpose tasks, sadly.
  I recently bought a second-hand 64GB Mac to experiment with. Even with the biggest recent local model it can run (llama3.3:70b just about runs acceptably; I've also tried an array of Qwen3 30b variants) the quality is lacking for coding support. They can sometimes write and iterate on a simple Python script, but sometimes fail, and for general-purpose models, often fail to answer questions accurately (not unsurprisingly, considering the model is a compression of knowledge, and these are comparatively small models). They are far, far away from the quality and ability of currently available Claude/Gemini/ChatGPT models. And even with a good eBay deal, the Mac cost the current equivalent of ~6 years of a monthly subscription to one of these.
  Based on the current state of play, once we can access relatively affordable systems with 512-1024GB fast (v)ram and sufficient FLOPs to match, we might have a meaningfully powerful local solution. Until then, I fear local only is for enthusiasts/hobbyists and niche non-general tasks.
GaggiX a day ago

>to today's subsidized pricing, which they can't keep up forever.
The APIs are not subsidized, they probably have quite the large margin actually: https://lmsys.org/blog/2025-05-05-large-scale-ep/
>Why would you pay OpenAI when you can host your own hyper efficient Chinese model
The 48GB of VRAM or unified memory required to run this model at 4bits is not free either.
- siliconc0w a day ago
  
  I didn't say its free but it is about 90% cheaper. Sonnet is $15 per million token output, this just dropped and is available at OpenRouter at $1.40. Even compared to Gemini Flash which is probably the best price-to-performance API is generally ranked lower than Qwen's models and is $2.50 so still %44 cheaper.