An 80% size reduction is no joke, and the fact that the 1.58-bit version runs on dual H100s at 140 tokens/s is kind of mind-blowing. That said, I’m still skeptical about how practical this really is for most people. Like, yeah, you can run it on 24GB VRAM or even with just 20GB RAM, but "slow" is an understatement—those speeds would make even the most patient person throw their hands up.
And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.
On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.
"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.
Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!
Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!
I think most of the model creators share their model usage examples so high at 0.6-0.7 simply because it's what a lot of the client apps use. IMO this is WAY too high unless you're doing creative writing.
Generally I set temp to 0-0.4 at absolute most.
min_p actually needs a little temperature to work effectively so with min_p I almost always use 0.2
Ye lower temp is also good :) Tbh its all trial and error - I found temp=1.5, min_p=0.1 to be very useful for pass@k type workloads - ie calling the LLM multiple times and aggregating.
temp=0 is also good for singular outputs. For classification tasks, it's better to actually inspect the logits.
But my goto setting is always setting min_p at least 0.01 or 0.05! It vastly suppresses incorrect rare random tokens from being created, and it helps massively!
>Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup?
AMD strix halo APU will have quad channel memory and will launch soon so expect these kinds of setups available for much less. Apple is charging an arm and a leg for memory upgrades, hopefully we get competition soon. From what I saw at CES OEMs are paying attention to this use case as well - hopefully not following suite on RAM markups.
Keep in mind the strix halo APU has a 256 bit wide memory bus and the Mac Ultra has a 1024 bit wide memory bus.
Here's hoping the Nvidia Digit (GB10 chip) has a 512 bit or 1024 bit wide interface, otherwise the Strix Halo will be the best you can do if you don't get the Mac Ultra.
I mean it remains to be seen if it will be compute or bandwidth bound, I am sure mac ultra will also have double or triple compute as well.
But in either case its going to do much better than currently available CPUs with easily upgradeable ram. I would not be surprised to see 128gb configurations for around 3k (going of the ASUS g13 announced pricing of arround 2k for 32gb version and them saying it will go up to 128gb).
At that point sure it might not compete with max but its at a much more acceptable price point, it will not be a device you get just for the AI, but a mobile workstation that you can also run some local models on for normal money. Will need to wait and see. I know I am not buying anything from ASUS either way.
Highly depends on how accessible AMD makes these boards, if a lot of OEMs get it there will be good deals for sure. DDR5 prices are nowhere near Apple markups.
> That said, I’m still skeptical about how practical this really is for most people.
I'm running Open WebUI for months now for me and some friends as a front-end to one of the API providers (deepinfra in my case, but there are many others, see https://artificialanalysis.ai/).
Having 1.58-bit is very practical for me. I'm looking much forward to the API provider adding this model to their system. They also added a Llama turbo (also quantized) a few months back so I have good hopes.
At my work, we self-host some models and have found that for anything remotely similar to RAG or use cases that are very specific, the quantized models have proven to be more than sufficient. This helps us keep them running on smaller infra and generally lower costs
Personally I've noticed major changes in performance between different quantisations of the same model.
Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.
This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.
That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)
There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.
But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!
I just ran it up on 48gb (2x 3090) + overflow into CPU RAM and it runs at around 4tk/s (only a little 8k context size though) which while absolutely not something I'd personally use daily - it is actually usable.
Not everyone needs the largest model. There are variations or R1 with fewer parameters that can easily run on consumer hardware. With 80% size reduction you could run 70B on 8-bit on an RTX 3090.
Other than that, if you really need the big one you can get six 3090s and you're good to go. It's not cheap, but you're running a ChatGPT equivalent model from your basement. A year ago this was a wetdream for most enthusiasts.
There’s a huge difference both in capabilities and in meaning between “variations of r1” and “r1 distill”. ollama is intentionally misleading people on this but the distills are much much worse
I ran whatever version Ollama downloaded on a 3070ti (laptop version). It's reasonably fast. Generative stuff can get weird if you do prompts like "in the style of" or "a new episode of" because it doesn't seem to have much pop culture in its training data. It knows the Stargate movie, for example, and seems to have the IMDB info for the series, but goes absolutely ham trying to summarize the series.
This line in the stuff inside the <think> section suggests it's also been trained on YouTube clips:
>> "I'm not entirely sure if I got all the details right, but this is what I remember from watching clips and summaries
online."
An excerpt from the generated summary:
>> "Set in the 23rd century during a Z-Corp invasion, the series features action sequences, strategic thinking, and humor. It explores themes of international espionage, space warfare, and humanity's role in the cosmos. The show incorporates musical numbers and catchy theme songs for an engaging viewing experience. The plot involves investigating alien warships and their secret base on Kessari planet while addressing personal conflicts and philosophical questions about
space."
"It explores themes of international espionage, space warfare, and humanity's role in the cosmos" is the closest to correct line in the whole output.
Ollama has been deliberately misrepresenting R1 distill models as "R1" for marketing purposes. A lot of "AI" influencers on social media are unabashedly doing the same. Ollama's default "R1" model is a 4-bit RTN quantized 7B model, which is nowhere close to the real R1 (a 671B parameter fp8 MoE).
no they are not, they intentionally remove every reference to this not being r1 from the cli and changed the names from the ones both Deepseek and Huggingface used.
> DeepSeek's first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
Well I guess if you are in the Enterprise Java naming model you would expect something like "VisitorModelUtilsListGetterAdapterInterceptorMessageManagerDrivenObserverPool"
If you look at their API docs you will see:
model: name of the model to push in the form of <namespace>/<model>:<tag>
I don't think there is any reason to jump to the conclusion it is some type of conspiracy here, just naming things based on a API that probably didn't think about distillation when they created it.
Yeah, they're so clear in fact that they call the distilled models "R1" in the url and everywhere on the page[1], instead of using the "DeepSeek-R1-Distill-" prefix, as DeepSeek themselves do[2].
It's fairly clear that R1-Llama or R1-Qwen is a distill, and they're all coming directly from DeepSeek.
As an aside, at least the larger distilled models (I'm mostly running r1-llama-distill-70b) are definitely not the same thing as the base llama/qwen models. I'm getting better results locally, admittedly with the slower inference time as it does the whole "<think>" section.
Surprisingly - The content in the <think> section is actually quite useful on its own. If you're using the model to spitball or brainstorm, getting to see it do that process is just flat out useful. Sometimes more-so than the actual answer it finally produces.
It's a model called Qwen, trained by Alibaba, which the DeepSeek team has used to "distill" knowledge from their own (100x bigger) model.
Think of it as forcing a junior Qwen to listen in while the smarter, PhD-level model was asked thousands of tough problems. It will acquire some of that knowledge and learn a lot of the reasoning process.
It cannot become exactly as smart, for the same reason a dog can learn lots of tricks from a human but not become human-level itself: it doesn't have enough neurons/capacity. Here, Qwen is a 7B model so it can't cram within 7 billion parameters as much data as you can cram into 671 billion. It can literally only learn 1% as much, BUT the distillation process is cleverly built and allows to focus on the "right" 1%.
Then this now-smarter Qwen is quantized. This means that we take its parameters (16-bit floats, super precise numbers) and truncate them to make them use less memory space. This also makes it less precise.
Think of it as taking a super high resolution movie picture and compressing it into a small GIF. You lose some information, but the gist of it is preserved.
As a result of both of these transformations, you get something that can run on your local machine — but is a bit dumber than the original — because it's about 400 times smaller than the real deal.
"Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud."
And I think they, the DeepSeek team, finetunes Qwen 7b on DeepSeek. That is how I understood it.
Which apparently makes it quite good for a 7b model. But, again: if I understood it correctly, is still just qween and without the reasoning of DeepSeek.
In my application, code generation, the distilled DeepSeek models (7B to 70B) perform poorly. They imitate the reasoning of the r1 model, but their conclusions are not correct.
The real r1 model is great, better than o1, but the distilled models are not even as good as the base models that they were distilled from.
It is hilariously bad at writing erotica when I've used jailbreaks on it. It's knowledge is the equivalent of a 1980s college kid with no access to pornography who watched an R rated movie once.
That's like trying to assemble an Ikea bookshelf with a bulldozer. All that extra power is doing nothing for the task you're asking of it, and there are plenty of lightweight alternatives.
I have MLCCHAT on my old Note 9 phone. It is actually still a great phone, but has 5GB RAM. Running an on device model is the first and only use case the RAM actually matters.
And it has a headphone jack, OK? I just hate Bluetooth earbuds. And yeah, it isna problem, but I digress.
When I run a 2.5B model, I get respectable output. Takes a minute or two to process the context, then output begins at somewhere on the order of 4 to 10 tokens per sec.
So, I just make a query and give it a few and I have my response.
Here is how I see it:
That little model, which is Gemma 2.2b sorry, knows a lot of stuff. It has knowledge I don't and it gives it to me in a reasonable, though predictable way. Answers are always of a certain teacher reminding student how it all goes way.
I don't care. Better is nice, but if I were stuck somewhere with no network, being able to query that model is amazing!
First aid, how to make fires, materials and uses. Fixing stuff, theories of operation, what things mean and more are in that thing ready for me to take advantage of.
I consider what I have fast. And it will get one or two orders faster over the next few years too.
I did it on a lark (ask the model what that means) and was surprised to see I gained a nice tool.
> First aid, how to make fires, materials and uses
This scares me more than it should...
Please do not trust an AI in actual life and death situations... Sure if it is literally your only option, but this implies you have a device on you that could make a phone call to an emergency number where a real human with real training and actually correct knowledge can assist you.
Even as an avid hiker the amount of times I've been out off cell service is miniscule and I absolutely refresh my knowledge on first aid regularly and any potential threats before a hike somewhere new.
I fired the model up at 8:39, ready to chat just a bit before 8:40, during which I input the following prompt:
Please define "lark" for me, as in "I did it on a lark."
Output began a bit before 8:41, and was complete before 8:43:
---Begin Model Output
"On a lark" means without much planning or reason, impulsively or unexpectedly.
Here's how it works:
"I did it on a lark." This means the person acted without a specific plan or reason.
"A lark" refers to the lark bird, a small songbird known for its playful nature.
So, "on a lark" suggests something spontaneous, casual, and often with a touch of playful whimsy.
Here are some other ways to understand it:
Without much thought: The person was likely to do it quickly and without much consideration.
Based on a whim: It's something they decided to do without a solid plan.
You know that's not the point. You can't even ask the dictionary app a question. You more or less just input and it'll give you what it has.
I can ask this model some pretty intense things and I get meaningful output back and it doesn't take very long and I don't need an internet connection. Like I said it's spiffy.
I will, have, and will use it again sans network connection and be happy I had the option.
Better will come. That is just fine. I have something very useful today, right now.
Layman here — but I am hopeful for 1.58 bit plus custom silicon to be the Holy Grail. I suppose I am setting high expectations on Apple to integrate said in their next "A" chip.
I do want a 192GB Mac Ultra, I'm hoping the Nvidia Digit achieves similar at $3,000. Sadly no specifications or benchmarks, so tokens/sec is just a guess at this point.
Random observation 1: I was running DeepSeek yesterday on my Linux with a RTX 4090 and I noticed that the models should fit into VRAM, which is 24GB. Or they are simply slow. So the Apple shared memory architecture has an advantage here. A 192GB Mx Ultra can load and process large models efficiently.
Random observation 2: It's time to cancel the OpenAI subscription.
I canceled my OpenAI subscription last night, as did many many others. There were some threads in reddit with everyone chiming in they all just canceled too. imo OpenAI is done, and will go through massive cuts and probably acquired by the end of the year for a very tiny fraction of its current value.
You want to bet?
The panic around deepseek is getting completely disconnected from reality.
Don’t get me wrong what DS did is great, but anyone thinking this reshape the fundamental trend of scaling laws and make compute irrelevant is dead wrong.
I’m sure OpenAI doesn’t really enjoy the PR right now, but guess what OpenAI/Google/Meta/Anthropic can do if you give them a recipe for 11x more efficient training ? They can scale it to their 100k GPUs clusters and still blow everything.
This will be textbook Jevons paradox.
Compute is still king and OpenAI has worked on their training platform longer than anyone.
Of course as soon as the next best model is released, we can train on its output and catch up at a fraction of the cost, and thus the infinite bunny hopping will continue.
> The panic around deepseek is getting completely disconnected from reality.
This entire hype cycle has long been completely disconnected from reality. I've watched a lot of hype waves, and I've never seen one that oscillates so wildly.
I think you're right that OpenAI isn't as hurt by DeepSeek as the mass panic would lead one to believe, but it's also true that DeepSeek exposes how blown out of proportion the initial hype waves were and how inflated the valuations are for this tech.
Meta has been demonstrating for a while that models are a commodity, not a product you can build a business on. DeepSeek proves that conclusively. OpenAI isn't finished, but they need to continue down the path they've already started and give up the idea that "getting to AGI" is a business model that doesn't require them to think about product.
In a sense it doesn't, in that if DeepSeek can do this, making OpenAI-type capabilities available for Llama-type infrastructure costs, then if you apply OpenAI scale infrastructure again to a much more efficient training/evaluation system, everything multiplies back up. I think that's where they'll have to head: using their infrastructure moat (such as it is) to apply these efficiency learnings to allow much more capable models at the top end. Yes, they can't sleep-walk into it, but I don't think that was ever the game.
> The panic around deepseek is getting completely disconnected from reality.
Couldn’t agree more! Nobody here read the manual. The last paragraph of DeepSeek’s R1 paper:
> Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.
Just based on my evaluations so far, R1 is not even an improvement on V3 in terms of real world coding problems because it gets stuck in stupid reasoning loops like whether “write C++ code to …” means it can use a C library or has to find a C++ wrapper which doesn’t exist.
OpenAI issue might be that it is extremely inefficient with money (high salaries, high compute costs, high expenses, etc..). This is fine when you have an absolute monopoly as investors will throw money your way (open ai is burning cash) but once an alternative is clear, you can no longer do that.
OpenAI doesn't have an advantage in compute more than Google, Microsoft or someone with a few billions of $$.
oh wow. I have been using kagi premium for months, and never noticed, that their AI assistant now has all the good AIs too. I was using kagi exclusively for search, and perplexity for ai stuff. I guess I can cut down on my subscriptions too. Thanks for your hint. (Also I noticed that kagi has a pwa for their ai assistent, which is also cool)
Computing is not king, DeepSeek just demonstrated otherwise. And yes, OpenAI will have to reinvent itself to copy DS, but this means they'll have to throw away a lot of their investment in existing tech. They might recover but it is not a minor hiccup as you suggest.
I just don't see how this is true. OpenAI has a massive cash & hardware pile -- they'll adapt and learn from what DeepSeek has done and be in a position to build and train 10x-50x-100x (or however) faster and better. They are getting a wake-up call for sure but I don't think much is going to be thrown away.
In my experience with deepseek and o1, openai's big talk about (and investment into) hallucination avoidance might save their hides here. Deepseek may be smarter, and understand complex problems better, but it also seems to make mistakes more often. (It's as if it's comprehension is better, but it's worse at memorization/recall.)
Need an LLM to one-shot some complex network scripting? as of last night, o1 is still where its at.
My experience gels with yours. Given the same code sample, DeepSeek has better, more creative suggestions about how to improve it, but it can't implement them without breaking the code. o1, generally, can implement DeepSeek's suggestions successfully. I think chaining them together might have quite interesting results.
That's ok if all you want to know is which model should I use today, but a test like that is totally dependent on training data, and there is no reason to expect that either DeepSeek-V3 (the base model for R1) or the additional training data for R1 is that same as what OpenAI used for O1 and whatever base model it was built on.
The benchmark comparisons are perhaps, for now, the best way to compare reasoning prowess of R1 vs O1, since it seems pretty certain they both trained for those cases.
I think the real significance of R1 isn't the released model/weights itself, but more the paper detailing (sans training data) how to replicate it, and how effective "distillation" (i.e. generate synthetic reasoning data for SFT) can be to enhance reasoning even without using RL.
The big deal here isn't that R1 makes any other models obsolete in terms of performance, but how cheap it is $2 vs $60 per million output tokens compared to O1 (which it matches in benchmark performance).
O1 vs R1 performance on specific non-benchmark problems is also not that relevant until people have replicated R1 and/or tried fine-tuning it with additional data. What would be interesting to see is whether (given the different usage of RL) there is any difference in how well R1 vs O1 generalize to reasoning capability over domains they were not specifically trained for. I'd expect that neither do that well, but not knowing details of what they were trained on makes it hard to test.
1. You can get all the models by buying Kagi subscription (excluding o1). Includes DeepSeek models. You can also feed the assistant with search data that you can filter.
2. If you have GitHub Copilot, you get o1 chat also there.
I haven't seen much value with OpenAI subscription for ages.
I have Kagi Ultimate and it is nice for this. But a cheaper suggestion would be to use OpenRouter and then use these models via Fireworks or TogetherAI. It also integrates into much more applications. AFAIK Kagi doesn't document a user facing API for the assistant feature.
Sure. I meant moreso that this would be cheaper than Kagi while providing the same selection of models.
As for deepseek, I couldn't even sign up because my email domain is not on their whitelist. To just try it out for now I don't mind the increased cost.
I disagree, I don't really need "conversational chat responses", I need multimodal
ChatGPT is the king of the multimodal experience still. Anthropic is a distant second, only because it lets you upload images from the clipboard and responds to them, but it can't do anything else like generate images - sometimes it will do a flowchat which is kind of cool, GPT won't do that - but will it speak to you, have tones, listen to you? no.
And in the open source side, this area has been stagnant for like 18 months. There is no cohesive multimodal experience yet. Just a couple vision models with chat capabilities and pretty pathetic GUIs to support them. You have to still do everything yourself there.
There is a huge utility for me, and many others that dont know it yet, if we could just load a couple models at once that work together seamlessly in a single seamless GUI like how ChatGPT works.
The real insult here is graphics card vendors refusing to make ones with more than 24GB for several years now. They do this so you'll have to buy several cards for your AI workstation. Hopefully Apple eating their lunch fixes this.
> They do this so you'll have to buy several cards for your AI workstation.
AFAIK you can't do that with newer consumer cards, which is why this became an annoyance. Even a RTX 4070 Ti with its 12 GB would be fine, if you could easily stack a bunch of them like you used to be able with older cards.
It's "easy" if you have a place to build an open frame rig with riser cables and whatnot. I can't do that, so I'm going the single slot waterblock route, which unfortunately rules out 3090s due to the memory on the back side of the PCB. It's very frustrating.
I think parents point is that NVLink no longer ships with consumer cards. Before you could buy two cards + a cable between them, and software can treat them as one card. Today you need software support for splitting between the cards, unless you go for "professional" cards or whatever they call them.
Maybe that's what they meant, and it'd be cool if nvidia still offered that on consumer cards, but thankfully you don't need it for LLM inference. The traffic between cards is very small.
Isn't the issue that the software needs to explicitly add support for it now, compared to yester-yesterday when you could just treat them as one in software?
There was a rumor that 5090 or 5090D for China may or may not come with multi-GPU software locked. I think GP's referring to that. It's not clear if it is the case with retail cards.
I honestly don’t know why people aren’t more upset by this and still get on their knees for Nvidia. They made the decision specifically to cripple consumer card memory because they didn’t like data centers were using them instead of buying their overpriced enterprise cards that were less performant. They removed NVLink because people were getting better performance out of their two $400 cards than the $1,500 cards Nvidia was trying to peddle. They willfully screw consumers and people love them for it.
It buys you approximately two days (with reservation discount) of a single p5.48xlarge instance, which has 2TB of RAM, and 640GB of VRAM in 8x H100 cards. In fact that is the pricing example they use: https://aws.amazon.com/ec2/capacityblocks/pricing/
MI300X (RunPod) 192gb ram
Hourly Rate: $2.49/hr.
Break-even Point: You can rent for 2,410 hours (~100 days of non-stop-continuous use) before reaching the cost of the $6000 Mac. Mac's top out at 192GB not 2TB ;)
Consideration: If your AI training requires sporadic use (e.g., a few hours daily or weekly), renting is significantly cheaper.
MI300X will also get you result many times faster too, so you could probably multiply that 100 days!
I disagree with cancelling the OpenAI subscription. I've been getting some help from o1 for both python and php recently, and o1 was doing massively better for the python stuff (it ran, deepseeks didn't and wont with prompt refinement).
Also for some philosophical stuff DeepSeek just won't do it. I'm working on an essay about spirituality and sometimes it just responds that it doesn't know how to work on those types of problems and we should do something fun like math or games, claud tends to reply with something more like "I have to be honest with you, reincarnation is not real" and ChatGPT doesn't seem to care about that kinda thing at all.
Just don’t ask it about anything related to Tiananmen square or president Pooh..
I’d guess they didn’t quite a bit of fine tuning to censor some more sensitive topics which probably impacts the output quality for other non technical subjects.
Why even bother decensoring it (except academic curiosity ig)? There are a million other ways you can learn about those subjects.
The people making the model probably don't really give a shit about politics and just did the minimum to avoid being embarassed, but if people start jailbreaking it they will be forced to care.
IIRC thezvi's summary post on R1 mentioned that R1 is amazing for general reasoning and is very clearly a successful proof of concept/capability but a lot of effort seems to have been put into making o1 Good At Code as a practical matter, whereas R1 seems to have been more a research project which proved out the approaches and then was released without sanding the rough edges off because that wasn't the point.
While 192GB of ram is appealing, it's also quite expensive at $6000. For that price I rather buy a system with 5 used 3090s, which while being "only" 120GB of VRAM, you benefit from much faster tokens/s and prompt processing speed (the macs are notoriously slow at consuming large contexts).
I think just getting nvidia Project Digits might be the best option. A lot of people when it was announced were underwhelmed. But I think now it could be just the thing for people making their own ai home servers.
Honestly, if you have a residence of some kind and an Internet connection, you don't need to bring your beefy computer with you everywhere. It is cool to be able to have ridiculously powerful mobile computers, but I don't think I would ever be willing to take a $6,000 laptop anywhere it has a decent chance of being stolen.
Laptops get stolen on a train? An enclosed, single-direction space that only occasionally allows you to exit between infrequent, long-distance stops? A thing that contains ticket inspectors and a literal guard?
How many laptops have you personally seen be stolen on a train?
You mean a tight, enclosed, single-direction space, crowded with people who are tired, and/or trying to relax, and/or thinking about the destination, and/or otherwise not particularly focused after hours of travel; a thing that contains ticket inspectors that show up every now and then to check tickets, and from which passengers embark and disembark at dozens point along the length of the thing, simultaneously, with no supervision or security checks.
Depending on the train type and configuration, many actually seem like pickpocket paradise.
Pickpocketing is a very different proposition. They relying on a lack of awareness, taking your wallet and being long gone before you’ve even noticed. If someone steals your laptop from in front of you without you even noticing I’d suggest that one is on you.
FWIW I’ve used my laptop on the train plenty, I’ve never had anything stolen nor felt in any danger of it.
...why would I ever do that? You leave something worth several thousand dollars anywhere in public you're risking losing it. What are we even debating here?
You might have seen some laptops have screens that fold down, I know MacBooks do. This "clam shell" effect protects the keyboard, trackpad, and even the screen from bumps and jostles. Many laptops when so closed can even fit in a backpack.
So a little trick I figured out is to close my laptop lid and then slide it into a pocket of my backpack. I can then carry it with me when I get up and move around.
So then I can take it with me to eat lunch or an extended toilet break. Maybe some day all laptops will have that feature.
Yes, all the time. It's happened to two people I know, in France and in the US.
People get up to use the bathroom or the cafe car, the laptop is left behind for ten minutes, one of the train stops is while they're away from their seat, and someone sees an opportunity, snags it, and gets off at the stop.
This is an actual thing. And if it's worth a thousand bucks then it's very much worth getting off at an earlier stop then you'd planned, and continuing your journey on the next train.
Ticket inspectors or guards are irrelevant. There isn't one in your car 99% of the time.
I don't why you're trying to argue laptop theft on trains in first-world countries isn't a thing. It absolutely is.
Different regions of the world would see different degrees of responsibilities regarding theft. I would consider absurd to leave unattended in a public space something valuable, considering the effort required to avoid that (that is: taking it with you).
So, yes, theft on trains for people that think they are 100% safe are a thing, but applying the same idea (to assume something is 100% safe and not be cautious) I wonder how do such people use the internet...
My coworker was having coffee and using his work laptop at an outdoor coffeeshop in Mountain View, CA. Someone on a bike rode by and attempted grab his phone and bike off with it.
The attempted thief didn't succeed in taking the phone, but did knock the laptop onto the ground, damaging it.
The discussion was about leaving unattended valuable objects in public places. Sure, a theft can happen even if attended, or using violence, but I personally avoid increasing the chance of having something stolen by leaving it unattended.
If I would make a statistics of primary cause of remaining without a laptop among people I know, the biggest danger is liquids in glasses (that ends up on the laptops) ...
You're going to take your laptop with you into the toilet on the train...?
I don't think I've ever seen a human being do that before on a train. Not to go to the toilet, nor to grab a coffee in another car.
You can't be paranoid about everything. My friend in France had put his laptop back into his bag where it wasn't visible and assumed that was good enough, but someone must have seen him do it and just took the whole bag.
You are applying a totally unreasonable standard, to suppose that the thefts were due to unreasonable carelessness. What, do you think someone should take their large luggage into the bathroom too, every time they need to pee?
Yes, if I go to the toilet I take my backpack/small bag with me, because usually I have valuable stuff in them and are easy to carry. This does not apply to a large bag (in which I don't put valuable stuff).
The standard is mine and I follow it. The same way I find absurd not to do it, you find it unreasonable to do it.
I find the expectation that things are not stolen (if unsupervised in public places) strange considering the huge amount of inequalities in wealth around even in civilized countries. I do not agree with the idea of stealing, thiefs should be punished, but expecting everybody "to behave" given the situation seems unrealistic to me.
That does not mean that I think that things are stolen 100% of the time. I have a friend that forgot a laptop on a bus (Netherlands) and the driver found it at the end of the line and gave it to lost objects so my friend got it back.
I mean, that's great for you, but it's not just what 99% of people do. You don't usually see people take their backpack into a train bathroom. I've taken a lot of trains and sat near the bathroom often enough (unfortunately). But like I said, it applies to the cafe car too.
If you find it absurd how 99% of people act on long-distance trains, I don't know what to tell you.
Ok - that's really poor opsec. If I'm going to the bathroom in a train with my laptop (whether it's expensive or not - it has access to all my stuff - which is arguably more valuable), I'll sleep it, put it in my backpack and take the backpack to the bathroom with me.
My work policies state you simply cannot leave your laptop out of sight for any period unless it's in a secure location (work|home). I feel the same way for my personal laptop as well.
You don't hear much about laptop thefts these days because phones are more valuable, more numerous, and much easier to steal.
Obviously, nobody steals things while the train is in motion. They wait until the train is about to leave the station, snatch a phone or handbag and jump out just as the door is closing. The train leaves, the thief blends in with other passenger leaving the station, and by the time news of the theft has made it from the passengers to the driver to the station staff the thief is long gone.
Of course people drive around $6,000+ cars all the time, so....
> Obviously, nobody steals things while the train is in motion.
Something interesting: I live near a train line where the doors are not automatic (they have to be opened manually on each stop), and there have been incidents where people get pickpocketed while the train is still in motion, and the thief jumps out right before the station, when the train has slowed down significantly but is still in motion. Many people have been hurt doing this.
Only on Hacker News would I have someone arguing with me that laptop theft is not a concern. You know what, you win. It's your $6,000 laptop, not mine.
A $6000 laptop doesn’t look much different than a $1000 laptop. I don’t think it’s a bigger theft risk than any other laptop.
Make sure the laptop is insured and that full disk encryption is enabled. If it’s a Mac, make sure you have it in Find My so you can wipe it remotely if that’s something you worry about.
Honestly, I didn't bother making a better case for why I wouldn't want a $6,000 laptop in large part because the nerve people have to argue that theft isn't a concern at all made me stubborn. Theft is one reason, but a laptop is also a hell of a lot easier to simply break or lose than a desktop that is permanently installed somewhere, and a desktop is more upgradable and repairable, with typically much more I/O.
Today's baseline laptops are really good as it is. 32-64 GiB of RAM is plenty, and at least on PC laptops you can do it fairly cheaply. Apple has been a consistent year or two ahead in mobile CPU performance but it fell out of my consideration ever since I realized the M1 and 7040 were both very sufficient for any local computation I cared about. (I'm not going to say I'd specifically go for less efficiency or performance, but it has become significantly lower priority over other things like repairability.)
Not really specifically hating on Apple, here. If I was going to get another Mac it'd be a Mac Mini or Mac Studio probably, ideally with a third-party SSD upgrade to both save on costs and get a slight bit of extra drive performance too. I've definitely considered it, even though I am very far from an Apple fan, just due to the superior value and efficiency they have in many categories.
For what it's worth, I never once insinuated that a laptop would get stolen on a train, only that I wouldn't want to bring such a laptop into the public in the first place. (Presumably, the laptop doesn't come into and exit existence upon entering and exiting the train, so this remains somewhat of a concern even if trains are involved.)
But yes, you're right. I've never personally seen a laptop get stolen. In fact, most people who have their laptop get stolen never see their laptop get stolen either.
I have, however, had coworkers who've had their laptops stolen. Multiple times.
You can use a desktop computer on a train if it's one with power outlets. Might get some funny looks, but I've seen it happen (or at least pictures). :)
>> While 192GB of ram is appealing, it's also quite expensive at $6000.
That's because it's Apple. It time to start moving to AMD systems with shared memory. My Zen 3 APU system has 64GB these days and its a mini ITX board.
For personal usage, does it matter though? In most places residential electricity is cheap compared to everything else. In a DC context I feel it matters a lot more compared to the capex.
When running inference workloads via something like llama.cpp, only 1 GPU is ever used at a time, so you would have 1 active GPU and 4 idle GPUs. That should make the power usage less insane in practice than you expect.
I think the last time any of my computers had a case was back when I realized the pair of 900gx2 cards I was running was turning my computer into an easy bake.
The good thing is since MoEs are mainly memory bound, we just need (VRAM + RAM) to be in the range of 80GB or so in my tests for at least 5 tokens or so /s.
It's better to get (VRAM + RAM) >= 140GB for at least 30 to 40 tokens/s, and if VRAM >= 140GB, then it can approach 140 tokens/s!
Another trick is to accept more than 8 experts per pass - it'll be slower, but might be more accurate. You could even try reducing the # of experts to say 6 or 7 for low FLOP machines!
Yes, shared memory is a pretty big leg up since it lets the GPU process the whole model even if the bandwidth is slower which still has some benefits.
Apple's M chips, AMD's Strix Point/Halo chips, Intel's Arc iGPUs, Nvidia's Jetsons. The main issue with all of these though is the lack of raw compute to complement the ability to load insanely large models.
So I'm thinking, inference seems mostly memory bound. With a fast CPU (for example 7950x with 16 cores), and 256GB of RAM (seems to be the max), shouldn't that give you plenty of ability to run the largest models (albeit a bit slowly).
It seems that AMD Epyc CPUs support terabytes of ram, some are as cheap as 1000 EUR. why not just run the full R1 model on that - seems that it would be much cheaper than multiple of those insane NVidia-Karten.
The bottleneck is mainly memory bandwidth. AMD EPYC hw is appealing for local inference because it has a higher memory bandwidth than desktop gear (because 8-12 memory channels vs 2 on almost everything else), but not as fast as the Apple architectures and nowhere near VRAM speeds. If you want to drastically exceed ~3-5 tokens/s on 70b-q4 models, you usually still need GPUs.
This was beautifully illustrated in the recent Phoronix 5090 LLM benchmark[1], which I noted here[2]. The tested GPUs had an almost perfect linear relationship between generated token/s and GB/s memory bandwidth, except the 5090 where it dipped slightly.
I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.
It’s more expensive, but Zen4 Threadripper Pro is probably the way to go on that front. 8 memory channels, with DIMMs available up to DDR5-7200 for 8x32GB (256GB), or DDR5-6800 for 8x48GB (384GB). It’ll set you back ~$3k for the RAM and ~$6k for a CPU with 8 CCDs (the 7985WX, at least), and then ~$1k for motherboard and however much you want to spend on NVME. Basically ~$10k for a 384GB DDR5 system with ~435GB/s actual bandwidth. Not quite as fast as the 192GB Apple machines, but twice as much memory and more compute for “only” a few thousand more.
is it confirmed that you can get 256gb of vram for that amount? Because my understanding is that digits pricing will start at $3k for some basic config.
Ok, but it is not clear what kind of RAM is that, how many memory channels, etc. If the goal is to have just 128GB of some ram, then it could be achieved by paying few $100.
Fine, but at that point you're arguing about the concept of the product. It's billed as a computer for AI and you're saying that it might not be more suitable for AI than a regular PC.
On Zen5 you also get AVX512 which llamafile takes advantage of for drastically improved speeds during prompt processing, at least. And the 12 channel Epycs actually seem to have more memory bandwidth available than the Apple M series. Especially considering it's all available to the CPU as opposed to just some portion of it.
Maybe EPYC can make better use of the available bandwidth, but for comparison I have a water cooled Xeon W5-3435X running at 4.7GHz all-core with 8 channels of DDR5-6400, and CPU inference is still dog slow. With a 70B Q8 model I get 1 tok/s, which is a lot less than I thought I would get with 410GB/s max RAM bandwidth. If I run on 5x A4000s I get 6.1 tok/s, which makes sense... 448GB/s / 70GB = 6.4 tok/s max.
FWIW Threadrippers go up to 1TB and Threadripper Pro up to 2TB. That's even in the lowest model of each series. (I know this because it happens to be the chip I have. Not saying you shouldn't go for Epyc if it works out better.)
Have you tried running the full R1 model with that? People in sibling comments mention high end EPYCs gor a 10K machine, but I’m curious whether it’s possible to make a 1-2K machine that could still run those big models simply because they fit in RAM.
Wow, an 80% reduction in size for DeepSeek-R1 is just amazing! It's fantastic to see such large models becoming more accessible to those of us who don't have access to top-tier hardware. This kind of optimization opens up so many possibilities for experimenting at home.
I'm impressed by the 140 tokens per second speed with the 1.58-bit quantization running on dual H100s. That kind of performance makes the model practical for small or mid sized shops to use it for local applications. This is a huge win for people working on agents that require low latency that only local models could support.
Btw completely off topic, but your comment triggered the internal classification in my brain, and it looks like AI-generated.
Not accusing you anything. Could be that you happen to write in a way similar to LLMs. Could be that we are influenced by LLM writing styles and are writing more and more like LLMs. Could be that the difference between LLM generated content and human-generated content is getting smaller and harder to tell.
haha you got me. I'm real person using LLM to proofread the stuff I write. English is not my native language and I'm trying to improve my written vocabulary a little bit. Sorry if it reads a little bit too off.
> Unfortunately if you naively quantize all layers to 1.58bit, you will get infinite repetitions in seed 3407: “Colours with dark Colours with dark Colours with dark Colours with dark Colours with dark” or in seed 3408: “Set up the Pygame's Pygame display with a Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's”.
This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!
would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.
Oh yes one could provide a repetition penalty for example - the issue is it's not just repetition that's the issue. I find it rather forgets what it already saw, and so hence it repeats stuff - it's probably best to backtrack, then delete the last few rows in the KV cache.
Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)
You likely mean sampler, not decoder. And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy. If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.
> And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy.
OF course you can't recover any accuracy, but LLM are in fact prone to this kind of repetition no matter what, this is a known failure mode that's why samplers aimed at avoiding this have been designed over the past few years.
> If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.
But it won't necessary be gibberish! even a highly quantized R1 has still much more embedded information than a 14 or even 32B model, so I don't see why it should output more gibberish than smaller models.
As someone who is out of the loop, what’s the verdict on R1? Was anyone able to reproduce the results yet? Is the claim that it only took $5M to train generally accepted?
It’s a very bold claim which is really shaking up the markets, so I can’t help but wonder if it was even verified at this point.
No, because the market is an aggregate of opinions, so it’s entirely fair to say it’s “generally accepted.” That has nothing to do with whether something happens to be true or not.
It may provide a financial opportunity for someone who disagrees with that aggregated opinion though.
my argument isn’t fallacious - it is logical: we can learn/use evidence from something without presuming it is all knowing. you are putting words in others mouths that they did not say
I'm sorry, I thought you introduced the "all-knowing" out of nowhere, but this was indeed mentioned by willsmith72. I'd missed that.
Still, his implied assertion that markets that markets can often behave irrationally, and can't be used as evidence of technical matters, seems pretty valid to me.
But I suppose you could see it as a sign that something is at least temporarily "generally accepted" among investors. That doesn't mean it's generally accepted among AI researchers, though.
Although I thought it was $6M rather than $5M, and that that was only the last step, and not the total investment. What does seem to be generally accepted among investors that this isn't good news for NVidia's profits, but that still doesn't mean that all the specific facts are generally accepted.
based on information and background they thoroughly gave when releasing their research its pretty easy to put together that it did take them significantly less resources to train this model. only having specific parameters available at a time instead of activating everything all at once is pretty ingenious.
that and they just happened to be undergoing a large scale "cyber attack"
I'm not sure I see the bear argument for NVidia here. Huge AI models certainly drive NVidia sales, but huge AI models are also widely thought to be untrainable and nearly un-runnable save for large datacenters.
To me, this is ripe for an application of the Jevons paradox. If architectural improvements make similar models cheaper, I would expect to see more of them trained and deployed, not fewer, ultimately increasing the market for GPU-like hardware.
They claimed that it only took $5 million of GPUs to train Deepseek v3, which was the base model. They did not claim that the total costs were $5 million. They omitted the costs of additional hardware, electricity, personnel, training dataset acquisition, etcetera. They likely spent tens of times more on this at a minimum.
That said, what they did with $5 million of GPUs is impressive. Reportedly, they resorted to using PTX assembly to make it possible:
I think the jury is out. With folks trying to replicate the process we will see if the low budget is true or not. I am still on the fence, there was comments from Scale CEO that they have a huge number of H100s they used. On the market side I think regardless if this was true or not, this gave people the opportunity to sell what is perhaps overinflated valuations.
That's likely only the marginal cost of training this model, and doesn't include a lot of other costs, like the datacenters and GPUs themselves which they already had and also the staff.
If they aren't lying because they have hardware they're not supposed to have, which is also a possibility.
these claims are getting more wrong every time i see them, weird game of telephone going around tech circles.
the cost absolutely includes the cost of GPUs and data centers, they quoted a standard price for renting h800 which has all of this built in. but yes, as very explicitly noted in the paper, it does not include cost of test iterations
ryzen 5500 + 7x3060 + cooling ~= 1.6 kW off the wall, at 360 GB/s memory bandwidth, and considering your lane budget, most of it will be wasted in single PCIe lanes. After-market unit price of 3060's is 200 eur, so 1600 is not good-faith cost estimate.
From the looks of it, your setup is neither low-power, nor low-cost. You'd be better served with a refurbished mac studio (2022) at 400GB/s bandwidth fully utilised over 96 GB memory. Yes, it will cost you 50% more (considering real cost of such system closer to 2000 eur) however it would run at a fraction of power use (10x less, more or less)
I get it that hobbyists like to build PC's, but claiming that sticking seven five year out of date low-bandwidth GPU's in a box is "low power/low cost" is a silly proposition.
The issue is that you are taking max GPU power draw, as a given. Running a LLM does not tax a GPU the same way a game does. There is a rather know Youtuber, that ran LLMs on a 4090, and the actual power draw was only 130W on the GPU.
Now add that this guy has 7x3060 = 100% miner. So you know that he is running a optimized profile (underclocked).
Fyi, my gaming 6800 draws 230W, but with a bit of undervolting and sacrificing 7% performance, it runs at 110W for the exact same load. And that is 100% taxed. This is just a simple example to show that a lot of PC hardware runs very much overclocked/unoptimized out of the box.
Somebody getting down to 520W sounds perfectly normal, for a undervolted card that gives up maybe 10% performance, for big gains in power draw.
And no, old hardware can be extreme useful in the right hands. Add to this, its the main factor that influences the speed tends to be more memory usage (the more you can fit and the interconnects), then actual processing performance for running a LLM.
Being able to run a large model for 1600 sounds like a bargain to me. Also, remember, when your not querying the models, the power will be mostly the memory wakes + power regulators. Coming back to that youtuber, he was not constantly drawing that 130W, it was only with spikes when he ran prompts or did activity.
Yes, running from home will be more expensive then a 10$ copilot plan but ... nobody is also looking at your data ;)
Thanks for the clarification. Surely, If I run hashcat benchmark the power consumption goes nearly to 1400 Watt, but I also limited the max power consumption for each card to 100 Watt, which worked out better than limiting the max gpu frequency. To be fair, the most speed comes from the RAM frequency - as long as this is not limited, it works out great.
I took a fair amount of time to get everything to a reduced power level and measured several llm models (and hashcat for the extreme) to find the best speed per watt, which is usally around 1700-1900 mhz or limiting 3060 to 100 to 115 watt.
If I planned it in the first run, I may got away with a used mac studio, thats right. However, I incrementally added more cards as I moved further into exploration.
I didn't wanted to confront someone, but it looks like you either show of 4x 4090 or you keep silent
I am amazed these days people lacking knowledge about hardware, and the mass benefits of undervolting/power limiting hardware. Its like people do not realize that what is sold, is often overclocked/too high vcore. The amount of people i see buying insane overspec PSUs, and go O_o ...
How is your performance with the different models on your setup?
"Undervolting" is a thing for 3090s where they get them down from 350 to 300W at 5% perf drop but for your case it's irrelevant because your lane budget is far too little!
> know Youtuber, that ran LLMs on a 4090, and the actual power draw was only 130W on the GPU.
Well, let's see his video. He must be using some really inefficient backend implementation if the GPU wasn't utilised like that.
I'm not running e-waste. My cards are L40S and even in basic inference, no batching with ggml cuda kernels they get to 70% util immediately.
Would be great if the next generation of base models was designed to be inferred with 128GB of VRAM while 8bit quantized (which would fit in the consumer hardware class).
For example, I imagine a strong MoE base with 16 billion active parameters and 6 or 7 experts would keep a good performance while being possible to run on 128GB RAM macbooks.
So I remember Deepseek used float8 for training - Character AI also used int8 for training - it is indeed possible, but sometimes training can be unstable - Deepseek to my knowledge is actually the first lab to use float8 at a large scale without causing loss spikes - they used FP8 tensor cores, then every 4th matrix multiply, they accumulated to a FP32 accumulator - it seems like the Hopper Tensor Cores accumulation mechanism might not be actual FP32 accumulation. I wrote more here: https://x.com/danielhanchen/status/1872719599029850391
Would be great, but unfortunately i think intelligence at that compute scale will be limit by hardware not its model. Though at hardware limit I would expect it to be roughly human level especially if optimized for a particular domain.
I remember that Llama 3 was trained on data curated by Llama 2 and it resulted in a model with a significant performance boost (even though it was trained by a previous generation model of the same size).
Maybe using a strong reasoning model such as R1 the next generation, even more performance can be extracted from smaller models.
That's already happening, and is in fact even part of the R1 training pipeline. An intermediate small reasoning model churns out training data for RL a larger model, rinse and repeat. Deepseek also showed model distillation with synthetic reasoning data to work quite well.
It’s a pretty neat paradigm and I see an abstract connection to how brains dream and produce their own synthetic training data while sleeping that supplements their real data used while awake.
Danielhanchen, your work is continually impressive. Unsloth is great, and I’m repeatedly amazed at your ability to get up to speed on a new model within hours of its release, and often fix bugs in the default implementation. At this point, I think serious labs should give you a few hour head start just to iron out their kinks!
The size reduction while keeping the model coherent is incredible. But I'm skeptical of how much effectiveness was retained. Flappy bird is well known and the kind of thing a non-reasoning model could het right. A better test would be something off the beaten path that R1 and o1 get right that other models don't.
The size reduction is impressive but unless I missed it, they don't list any standard benchmarks for comparison so we have no way to tell how it compares to the full-size model.
> DeepSeek-R1 has been making waves recently by rivaling OpenAI's O1 reasoning model while being fully open-source.
Do we finally have a model with access to the training architecture and training data set, or are we still calling non-reproducible binary blobs without source form open-source?
I assume when people say "open source model" they mean "open weights model". The "open source" term doesn't really make sense here, since machine learning models are not compilations of source code. (Though DeepSeek has published several papers with details on their training process. It's more than just open weights.)
If I invested in a 100x machine because I needed 100 of x to run, and somebody shows how 10x can work, why have I not just become the holder of 10 10x machines, and therefore have already achieved capex to exploit this new market?
I cannot understand why "openai is dead" has legs: repurpose the hardware and data and it can be multiple instances of the more efficient model.
you invest in a 100x machine expecting a revenue of X, but now you can only charge X/100 because R1 shows that AI inference can be done much more efficiently. see the price decrease of ChatGPT and addition of free O3 etc.
this reduction of future cash flows, ceteris paribus, implies that the present value of these cash flows decrease. this then results in massive repricing to the downside as market participants update their forecasts.
what you are missing is that to assume as you do, you must make the additional assumption that demand for additional compute is infinite. Which may very well be the case, but it is not guaranteed compared to the present realized fact that R1 means lower revenues for AI inference providers -> changes the capex justification for even more hardware -> NVDA receives less revenue.
:) It's my goto test :) I did amp it up by adding 10 conditions and made a scoring card - I found the original R1 to sometimes forget "import os" or miss some lines as well, so I thought it was at least a good check!
I also like to ask the models to create a simple basic Minecraft type game where you can break pieces and store them in your inventory, but disallow building stuff
I feel any AI can fix those problems when they can finally act. The problem AIs cannot run or debug code, or even book a hotel for me. When that is solved and an AI can interact with the code like a human does, it can fix its problems like a human does.
My understanding is with MoE (Mixture of Experts), you can and should shard it horizontally. The whole model is 600GB, but only 37GB is active during the evaluation of any single output token.
So you can load a different active subset of the MoE into each 89GB GPU, sharding it across something like 32 different GPUs (or can you get away with less? Wouldn't be surprised if they can infer on 8x H800 gpus). Some parameters are common, others are independent. Queries can be dynamically routed between GPUs, potentially bouncing between GPUs as much as once per output token, depending on which experts they need to activate.
Though, I suspect it's normal to stick on one MoE subset for several output tokens.
This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs.
Each MoE layer has its own router, and it activates 8 (out of 256) experts at a time. There's no reason to expect all of them to stay on the same GPU, so you're pretty much guaranteed to have to do all-to-all communication between the GPUs in your cluster after every layer for every token.
I had assumed the performance advantage for MoE came from minimising traffic between GPUs. But if it's per layer routing, then it's going to massively increase inter-gpu traffic compared to vertical slicing.
I guess that means the performance advantage actually comes when batching thousands of queries? The MoE routing would mean that on each MoE layer, each GPU shard gets a batch of queries that will all hit roughly the same subset of experts (and read the same weights from memory). The batches then shuffle between each MoE layer to re-optimise.
It's kind of like GPU raytracing where you get large performance gains by running coherency sorting on rays and batching similar rays together.
The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.
There are a few ways - the most basic is per layer sharding - DeepSeek uses 3 dense layers, so that can stay on GPU0 (with the embedding layer). There's 58 MoE layers (256 experts, 8 activated) and 1 shared expert per layer. GPU1 would house layers 3 to 9, and so on.
Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.
The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!
I love the original DeepSeek model, but the distilled versions are too dumb usually.
Apart from being dumber, they also don't know as much as R1. I can see how fine-tuning can improve reasoning capability (by showing examples of good CoT) but there's no reason that would improve the knowledge of facts (relative to the Qwen or Llama model on which the finetuning was based).
Using LM Studio, trying to load the model throws an error of "insufficient system resources."
I disabled this error, set the context length to 1024 and was able to get 0.24 tokens per second. Comparatively, the 32B distill model gets about 20 tokens per second.
And it became incredibly flaky, using up all available ram, and crashing the whole system a few times.
While the M4 Max 128GB handles the 32B well, it seems to choke on this. Here's to hoping someone works on something in-between (or works out what the ideal settings are because nothing I fiddled with helped much).
There's a terminal command to increase the maximum vram MacOS can use, you can try that as you're probably going over the limit and the system is resorting to treat as system ram. (I ran into this problem a couple of times using ollama).
Is this actually 1.58 bits? (Log base 2 of 3) I heard of another "1.58 bit" model that actually used 2 bits instead. "1.6 bit" is easy enough, you can pack five 3-state values into a byte by using values 0-242. Then unpacking is easy, you divide and modulo by 3 up to five times (or use a lookup table).
Is this akin to the quants already being done to various models when you download a GGUF at 4 bits for example, or is this variable layer compression something new that can also be make existing smaller models smaller so we can fit more into say 12 or 16 gb's of vram?
Big fan of unsloth, they have huge potential, could definitely need some experienced GTM people though, IMO. The pricing page and messages sent there are really not good.
Oh thanks :) Yes agreed we do need better GTM - temporarily it's still me and my brother running Unsloth, so for now we're just prioritizing many more engineering releases :)
This is an important step. Especially for beginners or people who are not in the loop, being able to easily type some simple commands to download, install dependencies, compile and run everything needed for a LLM AI model gives a feeling sci-fi; it's almost like you can have a helping brain at home.
One thing I've being thinking about doing is to combine one of those LLM models running in llama.cpp, feed it with the output of whisper.cpp and connect its output to some TTS model. I wonder how far from Wheels and Roadie from the Pole Position tv series.
Is there any good quick summary of what's special about DeepSeek? I know it's OSS and incredibly efficient, but news laymen are saying it's trained purely on AI info instead of using a corpus of tagged data... which, I assume, means it's somehow extracting weights or metadata or something from other AIs. Is that it?
It is going to be truly fucking revolutionary if open-source models are and continue to be able to challenge the state of the art. My big philosophical concern is that AI locks Capital into an absolutely supreme and insurmountable lead over Labour, and into the hands of oligarchs, and the possibility of a future where that's not case feels amazing. It pleases me greatly that this has Trump riled up too, because I think it means he's much less likely to allow existing US model-makers to build moats, as I think he's -- even as a man who I don't think believes in very much -- absolutely unwilling to let the Chinese get the drop on him over this.
I have no doubt open source will catch up (it already has, eh?) at the end of the day, it's just creative / new iterations on what is ultimately the transformer architecture... the amount of "secret" moat-like stuff that OpenAI was doing was bound to be figured out or exceeded eventually, like everything in tech...
Not to make fun of OpenAI and the great work they've done but it's kinda like if I went out in the 90s and said I'm going to found a company to have the best REST APIs. You can always found a successful tech company, but you can't found a successful tech company on a technological architecture or pattern alone.
Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .
And the United States subsidizes plenty of things too. For example the CHIPS act has $39 billion in subsidies for chip manufacturing on U.S. soil. There's nothing wrong with either country's subsidies. I personally don't believe in maximum free market. Government subsidy is more often than not a good thing and we need more of them both here and in China.
> Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .
And that's a really important strategic advantage China has versus America, which has such an insane fixation on pure(ish) free markets and free trade that it gives away its advantages in strategic industry after strategic industry.
Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.
> And that's a really important strategic advantage China has versus America, which has such an insane fixation on pure(ish) free markets and free trade that it gives away its advantages in strategic industry after strategic industry.
> Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.
The data we have is 500 years of free markets in the western world and the verdict is overwhelmingly: Yes, more freedom means more winning.
Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.
> The data we have is 500 years of free markets in the western world and the verdict is overwhelmingly: Yes, more freedom means more winning.
No, more freedom means more winning to a point. Past that point it does not, and I'd argue that's where the US is.
> Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.
That's supposed to be convincing, somehow? Just invite some "competent" capitalist over to your house, and he'll sell your fishing rod in exchange for a short-term discount on fish at the supermarket, and see how well you win.
An 80% size reduction is no joke, and the fact that the 1.58-bit version runs on dual H100s at 140 tokens/s is kind of mind-blowing. That said, I’m still skeptical about how practical this really is for most people. Like, yeah, you can run it on 24GB VRAM or even with just 20GB RAM, but "slow" is an understatement—those speeds would make even the most patient person throw their hands up.
And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.
On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.
"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.
Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!
Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!
min_p is great, do you apply a small amount of temperate as well?
The recommended temperature from DeepSeek is 0.6 so I leave it at that!
I think most of the model creators share their model usage examples so high at 0.6-0.7 simply because it's what a lot of the client apps use. IMO this is WAY too high unless you're doing creative writing.
Generally I set temp to 0-0.4 at absolute most.
min_p actually needs a little temperature to work effectively so with min_p I almost always use 0.2
Ye lower temp is also good :) Tbh its all trial and error - I found temp=1.5, min_p=0.1 to be very useful for pass@k type workloads - ie calling the LLM multiple times and aggregating.
temp=0 is also good for singular outputs. For classification tasks, it's better to actually inspect the logits.
But my goto setting is always setting min_p at least 0.01 or 0.05! It vastly suppresses incorrect rare random tokens from being created, and it helps massively!
Btw, min_p (the paper about the sampler) got accepted to ICLR! As 4th author it warms my heart to so it used so much in the wild.
Oh hi!! Congratulations on ICLR!!! min_p = 0.1 and temp = 1.5 is my default goto settings!!
>Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup?
AMD strix halo APU will have quad channel memory and will launch soon so expect these kinds of setups available for much less. Apple is charging an arm and a leg for memory upgrades, hopefully we get competition soon. From what I saw at CES OEMs are paying attention to this use case as well - hopefully not following suite on RAM markups.
Keep in mind the strix halo APU has a 256 bit wide memory bus and the Mac Ultra has a 1024 bit wide memory bus.
Here's hoping the Nvidia Digit (GB10 chip) has a 512 bit or 1024 bit wide interface, otherwise the Strix Halo will be the best you can do if you don't get the Mac Ultra.
I mean it remains to be seen if it will be compute or bandwidth bound, I am sure mac ultra will also have double or triple compute as well.
But in either case its going to do much better than currently available CPUs with easily upgradeable ram. I would not be surprised to see 128gb configurations for around 3k (going of the ASUS g13 announced pricing of arround 2k for 32gb version and them saying it will go up to 128gb).
At that point sure it might not compete with max but its at a much more acceptable price point, it will not be a device you get just for the AI, but a mobile workstation that you can also run some local models on for normal money. Will need to wait and see. I know I am not buying anything from ASUS either way.
Unfortunately, Apple’s RAM and Storage upgrade prices are very in line with other class comparable OEMs.
I’m sure there’ll be some amount of undercutting but I don’t think it’ll be a huge difference on the RAM side itself.
Highly depends on how accessible AMD makes these boards, if a lot of OEMs get it there will be good deals for sure. DDR5 prices are nowhere near Apple markups.
The upgrade from 8GB to 16GB cost $200...in 2012!!!!
The newest Lenovo workstations that use LPDDR5x with 7467MT/s gets you a 16GB jump for the price apple charges for an 8GB.
2:1 isn't "class comparable" IMHO.
> That said, I’m still skeptical about how practical this really is for most people.
I'm running Open WebUI for months now for me and some friends as a front-end to one of the API providers (deepinfra in my case, but there are many others, see https://artificialanalysis.ai/).
Having 1.58-bit is very practical for me. I'm looking much forward to the API provider adding this model to their system. They also added a Llama turbo (also quantized) a few months back so I have good hopes.
Oh I love Open WebUI as well!! But glad to hear the 1.58bit version could be helpful to you!
At my work, we self-host some models and have found that for anything remotely similar to RAG or use cases that are very specific, the quantized models have proven to be more than sufficient. This helps us keep them running on smaller infra and generally lower costs
Personally I've noticed major changes in performance between different quantisations of the same model.
Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.
This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.
That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)
There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.
But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!
How are you doing your evals?
Being able to do semantic diffs of the output of the two models should tell you what you need to do.
I just ran it up on 48gb (2x 3090) + overflow into CPU RAM and it runs at around 4tk/s (only a little 8k context size though) which while absolutely not something I'd personally use daily - it is actually usable.
I have similar set-up - can you help out with running it? Was it in ollama?
EDIT: It seems that original authors provided a nice write-up:
https://unsloth.ai/blog/deepseekr1-dynamic#:~:text=%F0%9F%96...
Yep that's pretty much what I did, their calculation for the layers was slightly off though, I found I could offload an extra 1-2 layers to the GPUs
Oh yes I reduced it by 4 for just in case :) I found sometimes the formula doesn't work, so in the worst case -4 was used - glad at least it ran!
Which one did you run?
Not everyone needs the largest model. There are variations or R1 with fewer parameters that can easily run on consumer hardware. With 80% size reduction you could run 70B on 8-bit on an RTX 3090.
Other than that, if you really need the big one you can get six 3090s and you're good to go. It's not cheap, but you're running a ChatGPT equivalent model from your basement. A year ago this was a wetdream for most enthusiasts.
There’s a huge difference both in capabilities and in meaning between “variations of r1” and “r1 distill”. ollama is intentionally misleading people on this but the distills are much much worse
They're really not? Both subjectively and in benchmarks there is no world in which the delta between the models deserves a "much much".
I ran whatever version Ollama downloaded on a 3070ti (laptop version). It's reasonably fast. Generative stuff can get weird if you do prompts like "in the style of" or "a new episode of" because it doesn't seem to have much pop culture in its training data. It knows the Stargate movie, for example, and seems to have the IMDB info for the series, but goes absolutely ham trying to summarize the series.
This line in the stuff inside the <think> section suggests it's also been trained on YouTube clips:
>> "I'm not entirely sure if I got all the details right, but this is what I remember from watching clips and summaries online."
An excerpt from the generated summary:
>> "Set in the 23rd century during a Z-Corp invasion, the series features action sequences, strategic thinking, and humor. It explores themes of international espionage, space warfare, and humanity's role in the cosmos. The show incorporates musical numbers and catchy theme songs for an engaging viewing experience. The plot involves investigating alien warships and their secret base on Kessari planet while addressing personal conflicts and philosophical questions about space."
"It explores themes of international espionage, space warfare, and humanity's role in the cosmos" is the closest to correct line in the whole output.
> ran whatever version Ollama downloaded on a 3070ti (laptop version). It's reasonably fast.
Probably was not r1, but one of the other models that got trained on r1, which apparently might still be quite good.
Ollama has been deliberately misrepresenting R1 distill models as "R1" for marketing purposes. A lot of "AI" influencers on social media are unabashedly doing the same. Ollama's default "R1" model is a 4-bit RTN quantized 7B model, which is nowhere close to the real R1 (a 671B parameter fp8 MoE).
https://www.reddit.com/r/LocalLLaMA/comments/1i8ifxd/ollama_...
Ollama is pretty clear about it, it's not like they are trying to deceive. You can also download the 671B model with Ollama, if you like.
no they are not, they intentionally remove every reference to this not being r1 from the cli and changed the names from the ones both Deepseek and Huggingface used.
They used short strings for the names, which is very different from deception.
https://ollama.com/search
> DeepSeek's first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
Well I guess if you are in the Enterprise Java naming model you would expect something like "VisitorModelUtilsListGetterAdapterInterceptorMessageManagerDrivenObserverPool"
If you look at their API docs you will see:
I don't think there is any reason to jump to the conclusion it is some type of conspiracy here, just naming things based on a API that probably didn't think about distillation when they created it.Yeah, they're so clear in fact that they call the distilled models "R1" in the url and everywhere on the page[1], instead of using the "DeepSeek-R1-Distill-" prefix, as DeepSeek themselves do[2].
[1]: https://ollama.com/library/deepseek-r1
[2]: https://github.com/deepseek-ai/DeepSeek-R1#deepseek-r1-disti...
I mean... yes. The DeepSeek announcement puts R1 right there in the name for those models. https://api-docs.deepseek.com/news/news250120
It's fairly clear that R1-Llama or R1-Qwen is a distill, and they're all coming directly from DeepSeek.
As an aside, at least the larger distilled models (I'm mostly running r1-llama-distill-70b) are definitely not the same thing as the base llama/qwen models. I'm getting better results locally, admittedly with the slower inference time as it does the whole "<think>" section.
Surprisingly - The content in the <think> section is actually quite useful on its own. If you're using the model to spitball or brainstorm, getting to see it do that process is just flat out useful. Sometimes more-so than the actual answer it finally produces.
I'm not too hip to all the LLM terminology, so maybe someone can make sense of this and see if it's r1 or something based on r1:
>>> /show info
So... it's not R1 itself.
It's a model called Qwen, trained by Alibaba, which the DeepSeek team has used to "distill" knowledge from their own (100x bigger) model.
Think of it as forcing a junior Qwen to listen in while the smarter, PhD-level model was asked thousands of tough problems. It will acquire some of that knowledge and learn a lot of the reasoning process.
It cannot become exactly as smart, for the same reason a dog can learn lots of tricks from a human but not become human-level itself: it doesn't have enough neurons/capacity. Here, Qwen is a 7B model so it can't cram within 7 billion parameters as much data as you can cram into 671 billion. It can literally only learn 1% as much, BUT the distillation process is cleverly built and allows to focus on the "right" 1%.
Then this now-smarter Qwen is quantized. This means that we take its parameters (16-bit floats, super precise numbers) and truncate them to make them use less memory space. This also makes it less precise. Think of it as taking a super high resolution movie picture and compressing it into a small GIF. You lose some information, but the gist of it is preserved.
As a result of both of these transformations, you get something that can run on your local machine — but is a bit dumber than the original — because it's about 400 times smaller than the real deal.
"Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud."
And I think they, the DeepSeek team, finetunes Qwen 7b on DeepSeek. That is how I understood it.
Which apparently makes it quite good for a 7b model. But, again: if I understood it correctly, is still just qween and without the reasoning of DeepSeek.
In my application, code generation, the distilled DeepSeek models (7B to 70B) perform poorly. They imitate the reasoning of the r1 model, but their conclusions are not correct.
The real r1 model is great, better than o1, but the distilled models are not even as good as the base models that they were distilled from.
it’s a distill, it’s going to be much much worse than r1
It is hilariously bad at writing erotica when I've used jailbreaks on it. It's knowledge is the equivalent of a 1980s college kid with no access to pornography who watched an R rated movie once.
That's like trying to assemble an Ikea bookshelf with a bulldozer. All that extra power is doing nothing for the task you're asking of it, and there are plenty of lightweight alternatives.
Or if you want a large model but don’t need high performance, get a Mac with 128GB UMA.
How many tokens/s would you get in such a setup?
This Reddit thread says an m3 max 128GB gets 23 tokens/sec with deepseek r1 32B, and 4 tokens / sec with 70b: https://www.reddit.com/r/LocalLLaMA/comments/1i69dhz/deepsee...
People would only be 'throwing their hands up' because commercial LLMs have set unreasonable expectations for folks.
Anyone who has a/the need for or understands the value of a local LLM would be OK with this kind of output.
Everyone has the need for on device LLM, if the response rate was fast!
I have MLCCHAT on my old Note 9 phone. It is actually still a great phone, but has 5GB RAM. Running an on device model is the first and only use case the RAM actually matters.
And it has a headphone jack, OK? I just hate Bluetooth earbuds. And yeah, it isna problem, but I digress.
When I run a 2.5B model, I get respectable output. Takes a minute or two to process the context, then output begins at somewhere on the order of 4 to 10 tokens per sec.
So, I just make a query and give it a few and I have my response.
Here is how I see it:
That little model, which is Gemma 2.2b sorry, knows a lot of stuff. It has knowledge I don't and it gives it to me in a reasonable, though predictable way. Answers are always of a certain teacher reminding student how it all goes way.
I don't care. Better is nice, but if I were stuck somewhere with no network, being able to query that model is amazing!
First aid, how to make fires, materials and uses. Fixing stuff, theories of operation, what things mean and more are in that thing ready for me to take advantage of.
I consider what I have fast. And it will get one or two orders faster over the next few years too.
I did it on a lark (ask the model what that means) and was surprised to see I gained a nice tool.
> First aid, how to make fires, materials and uses
This scares me more than it should...
Please do not trust an AI in actual life and death situations... Sure if it is literally your only option, but this implies you have a device on you that could make a phone call to an emergency number where a real human with real training and actually correct knowledge can assist you.
Even as an avid hiker the amount of times I've been out off cell service is miniscule and I absolutely refresh my knowledge on first aid regularly and any potential threats before a hike somewhere new.
I asked the question. It did a respectable job.
For choking, as an example, it outlines all the basics, affirmed it was educational info only, and urged the help of a professional.
All in all, reasonable.
Your point stands, but this little model 2.5B has clear bounds it seems.
I plan to poke at it a few more times asking for higher risk info, curious about what it may do.
Of course! I do the same. However, I won't deny being able to get some information, even if I must validate it with care, jn a pinch is a great thing.
It just a tool in the tool box. Like any tool, one must respect and use it with care.
I fired the model up at 8:39, ready to chat just a bit before 8:40, during which I input the following prompt:
Please define "lark" for me, as in "I did it on a lark."
Output began a bit before 8:41, and was complete before 8:43:
---Begin Model Output
"On a lark" means without much planning or reason, impulsively or unexpectedly.
Here's how it works:
"I did it on a lark." This means the person acted without a specific plan or reason. "A lark" refers to the lark bird, a small songbird known for its playful nature.
So, "on a lark" suggests something spontaneous, casual, and often with a touch of playful whimsy.
Here are some other ways to understand it:
Without much thought: The person was likely to do it quickly and without much consideration. Based on a whim: It's something they decided to do without a solid plan.
Let me know if you'd like more examples
---End Model Output
To me, that is pretty damn spiffy!
I opened Dictionary.app and typed in lark and got the definition with etymology in under 30 seconds.
> British English informal used to suggest that an activity is foolish or a waste of time: he's serious about this music lark.
You know that's not the point. You can't even ask the dictionary app a question. You more or less just input and it'll give you what it has.
I can ask this model some pretty intense things and I get meaningful output back and it doesn't take very long and I don't need an internet connection. Like I said it's spiffy.
I will, have, and will use it again sans network connection and be happy I had the option.
Better will come. That is just fine. I have something very useful today, right now.
I use commercial LLMs every day. The best of them can still be infuriating at times to the point of being unproductive. So I'm not sure I agree here.
The repetition issue happens on simple quantization, what they are releasing is an approach that fixes that.
Yes exactly! I edited the blog post to make the wording a bit better!
Layman here — but I am hopeful for 1.58 bit plus custom silicon to be the Holy Grail. I suppose I am setting high expectations on Apple to integrate said in their next "A" chip.
Wishful thinking.
Ye a custom chip would be insane! 1.5 bit with a scaling factor seems to be actually usable for MoEs with shared experts!
I do want a 192GB Mac Ultra, I'm hoping the Nvidia Digit achieves similar at $3,000. Sadly no specifications or benchmarks, so tokens/sec is just a guess at this point.
> I’d rather build a rig with used 3090s and get way more bang for my buck
I'm curious, what would you use that rig for?
Random observation 1: I was running DeepSeek yesterday on my Linux with a RTX 4090 and I noticed that the models should fit into VRAM, which is 24GB. Or they are simply slow. So the Apple shared memory architecture has an advantage here. A 192GB Mx Ultra can load and process large models efficiently.
Random observation 2: It's time to cancel the OpenAI subscription.
I canceled my OpenAI subscription last night, as did many many others. There were some threads in reddit with everyone chiming in they all just canceled too. imo OpenAI is done, and will go through massive cuts and probably acquired by the end of the year for a very tiny fraction of its current value.
You want to bet? The panic around deepseek is getting completely disconnected from reality.
Don’t get me wrong what DS did is great, but anyone thinking this reshape the fundamental trend of scaling laws and make compute irrelevant is dead wrong. I’m sure OpenAI doesn’t really enjoy the PR right now, but guess what OpenAI/Google/Meta/Anthropic can do if you give them a recipe for 11x more efficient training ? They can scale it to their 100k GPUs clusters and still blow everything. This will be textbook Jevons paradox.
Compute is still king and OpenAI has worked on their training platform longer than anyone.
Of course as soon as the next best model is released, we can train on its output and catch up at a fraction of the cost, and thus the infinite bunny hopping will continue.
But OpenAI is very much alive.
> The panic around deepseek is getting completely disconnected from reality.
This entire hype cycle has long been completely disconnected from reality. I've watched a lot of hype waves, and I've never seen one that oscillates so wildly.
I think you're right that OpenAI isn't as hurt by DeepSeek as the mass panic would lead one to believe, but it's also true that DeepSeek exposes how blown out of proportion the initial hype waves were and how inflated the valuations are for this tech.
Meta has been demonstrating for a while that models are a commodity, not a product you can build a business on. DeepSeek proves that conclusively. OpenAI isn't finished, but they need to continue down the path they've already started and give up the idea that "getting to AGI" is a business model that doesn't require them to think about product.
In a sense it doesn't, in that if DeepSeek can do this, making OpenAI-type capabilities available for Llama-type infrastructure costs, then if you apply OpenAI scale infrastructure again to a much more efficient training/evaluation system, everything multiplies back up. I think that's where they'll have to head: using their infrastructure moat (such as it is) to apply these efficiency learnings to allow much more capable models at the top end. Yes, they can't sleep-walk into it, but I don't think that was ever the game.
> The panic around deepseek is getting completely disconnected from reality.
Couldn’t agree more! Nobody here read the manual. The last paragraph of DeepSeek’s R1 paper:
> Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.
Just based on my evaluations so far, R1 is not even an improvement on V3 in terms of real world coding problems because it gets stuck in stupid reasoning loops like whether “write C++ code to …” means it can use a C library or has to find a C++ wrapper which doesn’t exist.
OpenAI issue might be that it is extremely inefficient with money (high salaries, high compute costs, high expenses, etc..). This is fine when you have an absolute monopoly as investors will throw money your way (open ai is burning cash) but once an alternative is clear, you can no longer do that.
OpenAI doesn't have an advantage in compute more than Google, Microsoft or someone with a few billions of $$.
Deepseek is not the only reason. I cancelled my OpenAI subscription because I've replaced it wholesale with Anthropic.
I replaced that with kagi, unliminted access to multiple models including Claude, O1 and V3/R1 + you also get Kagi, which was already a good deal
oh wow. I have been using kagi premium for months, and never noticed, that their AI assistant now has all the good AIs too. I was using kagi exclusively for search, and perplexity for ai stuff. I guess I can cut down on my subscriptions too. Thanks for your hint. (Also I noticed that kagi has a pwa for their ai assistent, which is also cool)
> You want to bet?
Why would anyone bet? They can just short the OpenAI / MS stocks, and see in a few months if they were right or not.
OpenAI isn't publicly traded and MSFT's stake is so minor compared to their other business that it will have a negligible impact on their stock price.
1) OpenAI isn't public, so not possible. 2) MS is one of the most well diversified tech companies, so, if anything, this will be a positive.
How is that any different from a bet?
Computing is not king, DeepSeek just demonstrated otherwise. And yes, OpenAI will have to reinvent itself to copy DS, but this means they'll have to throw away a lot of their investment in existing tech. They might recover but it is not a minor hiccup as you suggest.
I just don't see how this is true. OpenAI has a massive cash & hardware pile -- they'll adapt and learn from what DeepSeek has done and be in a position to build and train 10x-50x-100x (or however) faster and better. They are getting a wake-up call for sure but I don't think much is going to be thrown away.
IMHO o1 it’s still comparable to a lot better for accomplishing actual stuff than DeepSeek. At least for my use cases.
Of course cost is incomparably higher since plus has a very low limit. Which of course is a huge deal.
In my experience with deepseek and o1, openai's big talk about (and investment into) hallucination avoidance might save their hides here. Deepseek may be smarter, and understand complex problems better, but it also seems to make mistakes more often. (It's as if it's comprehension is better, but it's worse at memorization/recall.)
Need an LLM to one-shot some complex network scripting? as of last night, o1 is still where its at.
My experience gels with yours. Given the same code sample, DeepSeek has better, more creative suggestions about how to improve it, but it can't implement them without breaking the code. o1, generally, can implement DeepSeek's suggestions successfully. I think chaining them together might have quite interesting results.
Is there a tool that can automate chaining like that?
Aider has an architect mode where it asks one model to plan out the changes and another to actually write the code.
I've used it today, with R1 as architect and Sonnet as editor model. So far, this works great. There's no need to use a reasoning model as editor IMO.
Alex (https://alexcodes.app) also does this now btw.
That's ok if all you want to know is which model should I use today, but a test like that is totally dependent on training data, and there is no reason to expect that either DeepSeek-V3 (the base model for R1) or the additional training data for R1 is that same as what OpenAI used for O1 and whatever base model it was built on.
The benchmark comparisons are perhaps, for now, the best way to compare reasoning prowess of R1 vs O1, since it seems pretty certain they both trained for those cases.
I think the real significance of R1 isn't the released model/weights itself, but more the paper detailing (sans training data) how to replicate it, and how effective "distillation" (i.e. generate synthetic reasoning data for SFT) can be to enhance reasoning even without using RL.
Why every time there is a new model all the other competitors are declared immediately dead?
The big deal here isn't that R1 makes any other models obsolete in terms of performance, but how cheap it is $2 vs $60 per million output tokens compared to O1 (which it matches in benchmark performance).
O1 vs R1 performance on specific non-benchmark problems is also not that relevant until people have replicated R1 and/or tried fine-tuning it with additional data. What would be interesting to see is whether (given the different usage of RL) there is any difference in how well R1 vs O1 generalize to reasoning capability over domains they were not specifically trained for. I'd expect that neither do that well, but not knowing details of what they were trained on makes it hard to test.
Because we like drama.
1. You can get all the models by buying Kagi subscription (excluding o1). Includes DeepSeek models. You can also feed the assistant with search data that you can filter.
2. If you have GitHub Copilot, you get o1 chat also there.
I haven't seen much value with OpenAI subscription for ages.
I have Kagi Ultimate and it is nice for this. But a cheaper suggestion would be to use OpenRouter and then use these models via Fireworks or TogetherAI. It also integrates into much more applications. AFAIK Kagi doesn't document a user facing API for the assistant feature.
Unfortunately those are both 10-15x the cost of deepseek direct.
Deepinfra is pretty cheap though as a deepseek provider.
Sure. I meant moreso that this would be cheaper than Kagi while providing the same selection of models.
As for deepseek, I couldn't even sign up because my email domain is not on their whitelist. To just try it out for now I don't mind the increased cost.
That's because DeepSeek is subsidizing their API massively to get more training data.
Doesn't Microsoft own 49% of OpenAI? They'll end up with it all as a division of Microsoft.
I think they “own” 49% of OpenAI’s net income until a certain very high amount. Not a share of the actual company.
They "own" even 75% of profits until Microsoft has recouped its $13 billion investment. 49% comes after that.
I disagree, I don't really need "conversational chat responses", I need multimodal
ChatGPT is the king of the multimodal experience still. Anthropic is a distant second, only because it lets you upload images from the clipboard and responds to them, but it can't do anything else like generate images - sometimes it will do a flowchat which is kind of cool, GPT won't do that - but will it speak to you, have tones, listen to you? no.
And in the open source side, this area has been stagnant for like 18 months. There is no cohesive multimodal experience yet. Just a couple vision models with chat capabilities and pretty pathetic GUIs to support them. You have to still do everything yourself there.
There is a huge utility for me, and many others that dont know it yet, if we could just load a couple models at once that work together seamlessly in a single seamless GUI like how ChatGPT works.
The real insult here is graphics card vendors refusing to make ones with more than 24GB for several years now. They do this so you'll have to buy several cards for your AI workstation. Hopefully Apple eating their lunch fixes this.
The 5090 is 32GB out of the box. Not that that's anywhere near the top of what you can do on an Apple, but at least it's movement.
> They do this so you'll have to buy several cards for your AI workstation.
AFAIK you can't do that with newer consumer cards, which is why this became an annoyance. Even a RTX 4070 Ti with its 12 GB would be fine, if you could easily stack a bunch of them like you used to be able with older cards.
It's "easy" if you have a place to build an open frame rig with riser cables and whatnot. I can't do that, so I'm going the single slot waterblock route, which unfortunately rules out 3090s due to the memory on the back side of the PCB. It's very frustrating.
I think parents point is that NVLink no longer ships with consumer cards. Before you could buy two cards + a cable between them, and software can treat them as one card. Today you need software support for splitting between the cards, unless you go for "professional" cards or whatever they call them.
Maybe that's what they meant, and it'd be cool if nvidia still offered that on consumer cards, but thankfully you don't need it for LLM inference. The traffic between cards is very small.
Isn't the issue that the software needs to explicitly add support for it now, compared to yester-yesterday when you could just treat them as one in software?
There was a rumor that 5090 or 5090D for China may or may not come with multi-GPU software locked. I think GP's referring to that. It's not clear if it is the case with retail cards.
I honestly don’t know why people aren’t more upset by this and still get on their knees for Nvidia. They made the decision specifically to cripple consumer card memory because they didn’t like data centers were using them instead of buying their overpriced enterprise cards that were less performant. They removed NVLink because people were getting better performance out of their two $400 cards than the $1,500 cards Nvidia was trying to peddle. They willfully screw consumers and people love them for it.
Because sensible people just use the cloud at this point, you can probably get several years of training for $6000
It buys you approximately two days (with reservation discount) of a single p5.48xlarge instance, which has 2TB of RAM, and 640GB of VRAM in 8x H100 cards. In fact that is the pricing example they use: https://aws.amazon.com/ec2/capacityblocks/pricing/
MI300X (RunPod) 192gb ram Hourly Rate: $2.49/hr. Break-even Point: You can rent for 2,410 hours (~100 days of non-stop-continuous use) before reaching the cost of the $6000 Mac. Mac's top out at 192GB not 2TB ;) Consideration: If your AI training requires sporadic use (e.g., a few hours daily or weekly), renting is significantly cheaper. MI300X will also get you result many times faster too, so you could probably multiply that 100 days!
Or buy 2 Nvidia digits for $6,000 to get 256GB vram.
I disagree with cancelling the OpenAI subscription. I've been getting some help from o1 for both python and php recently, and o1 was doing massively better for the python stuff (it ran, deepseeks didn't and wont with prompt refinement).
Also for some philosophical stuff DeepSeek just won't do it. I'm working on an essay about spirituality and sometimes it just responds that it doesn't know how to work on those types of problems and we should do something fun like math or games, claud tends to reply with something more like "I have to be honest with you, reincarnation is not real" and ChatGPT doesn't seem to care about that kinda thing at all.
Just don’t ask it about anything related to Tiananmen square or president Pooh..
I’d guess they didn’t quite a bit of fine tuning to censor some more sensitive topics which probably impacts the output quality for other non technical subjects.
Would fine-tuning by using a LoRA paper over the censorship to a large degree?
Why even bother decensoring it (except academic curiosity ig)? There are a million other ways you can learn about those subjects.
The people making the model probably don't really give a shit about politics and just did the minimum to avoid being embarassed, but if people start jailbreaking it they will be forced to care.
Because the best way to learn is through interrogation.
I don't give a damn about ideology I just want everything ever thought or written searchable and interactive
IIRC thezvi's summary post on R1 mentioned that R1 is amazing for general reasoning and is very clearly a successful proof of concept/capability but a lot of effort seems to have been put into making o1 Good At Code as a practical matter, whereas R1 seems to have been more a research project which proved out the approaches and then was released without sanding the rough edges off because that wasn't the point.
Were you running a local model?
While 192GB of ram is appealing, it's also quite expensive at $6000. For that price I rather buy a system with 5 used 3090s, which while being "only" 120GB of VRAM, you benefit from much faster tokens/s and prompt processing speed (the macs are notoriously slow at consuming large contexts).
I think just getting nvidia Project Digits might be the best option. A lot of people when it was announced were underwhelmed. But I think now it could be just the thing for people making their own ai home servers.
https://www.nvidia.com/en-us/project-digits/
Yep, I think the same. With 128GB fast memory one could run this.
Can I use that on the train though? I can with a 128GB MacBook, without it sounding like a helicopter taking off as well.
you don't need to take ai training quite so literally (:
Honestly, if you have a residence of some kind and an Internet connection, you don't need to bring your beefy computer with you everywhere. It is cool to be able to have ridiculously powerful mobile computers, but I don't think I would ever be willing to take a $6,000 laptop anywhere it has a decent chance of being stolen.
Do you live in a third world country? If so I might agree, but otherwise trains are perfectly safe.
I am very happy for you, but laptops get stolen in public in most countries.
Laptops get stolen on a train? An enclosed, single-direction space that only occasionally allows you to exit between infrequent, long-distance stops? A thing that contains ticket inspectors and a literal guard?
How many laptops have you personally seen be stolen on a train?
You mean a tight, enclosed, single-direction space, crowded with people who are tired, and/or trying to relax, and/or thinking about the destination, and/or otherwise not particularly focused after hours of travel; a thing that contains ticket inspectors that show up every now and then to check tickets, and from which passengers embark and disembark at dozens point along the length of the thing, simultaneously, with no supervision or security checks.
Depending on the train type and configuration, many actually seem like pickpocket paradise.
Pickpocketing is a very different proposition. They relying on a lack of awareness, taking your wallet and being long gone before you’ve even noticed. If someone steals your laptop from in front of you without you even noticing I’d suggest that one is on you.
FWIW I’ve used my laptop on the train plenty, I’ve never had anything stolen nor felt in any danger of it.
But would you consider leaving it unattended on your seat and going for lunch to the restaurant car, or for an extended toilet break?
...why would I ever do that? You leave something worth several thousand dollars anywhere in public you're risking losing it. What are we even debating here?
You might have seen some laptops have screens that fold down, I know MacBooks do. This "clam shell" effect protects the keyboard, trackpad, and even the screen from bumps and jostles. Many laptops when so closed can even fit in a backpack.
So a little trick I figured out is to close my laptop lid and then slide it into a pocket of my backpack. I can then carry it with me when I get up and move around.
So then I can take it with me to eat lunch or an extended toilet break. Maybe some day all laptops will have that feature.
Is your laptop in your pocket?
Yes, all the time. It's happened to two people I know, in France and in the US.
People get up to use the bathroom or the cafe car, the laptop is left behind for ten minutes, one of the train stops is while they're away from their seat, and someone sees an opportunity, snags it, and gets off at the stop.
This is an actual thing. And if it's worth a thousand bucks then it's very much worth getting off at an earlier stop then you'd planned, and continuing your journey on the next train.
Ticket inspectors or guards are irrelevant. There isn't one in your car 99% of the time.
I don't why you're trying to argue laptop theft on trains in first-world countries isn't a thing. It absolutely is.
Different regions of the world would see different degrees of responsibilities regarding theft. I would consider absurd to leave unattended in a public space something valuable, considering the effort required to avoid that (that is: taking it with you).
So, yes, theft on trains for people that think they are 100% safe are a thing, but applying the same idea (to assume something is 100% safe and not be cautious) I wonder how do such people use the internet...
My coworker was having coffee and using his work laptop at an outdoor coffeeshop in Mountain View, CA. Someone on a bike rode by and attempted grab his phone and bike off with it.
The attempted thief didn't succeed in taking the phone, but did knock the laptop onto the ground, damaging it.
The discussion was about leaving unattended valuable objects in public places. Sure, a theft can happen even if attended, or using violence, but I personally avoid increasing the chance of having something stolen by leaving it unattended.
If I would make a statistics of primary cause of remaining without a laptop among people I know, the biggest danger is liquids in glasses (that ends up on the laptops) ...
A random person able to dart in and then make a getaway is not what "working on a train" is like and that was the original comment's point.
You're going to take your laptop with you into the toilet on the train...?
I don't think I've ever seen a human being do that before on a train. Not to go to the toilet, nor to grab a coffee in another car.
You can't be paranoid about everything. My friend in France had put his laptop back into his bag where it wasn't visible and assumed that was good enough, but someone must have seen him do it and just took the whole bag.
You are applying a totally unreasonable standard, to suppose that the thefts were due to unreasonable carelessness. What, do you think someone should take their large luggage into the bathroom too, every time they need to pee?
Talk about victim-blaming.
Yes, if I go to the toilet I take my backpack/small bag with me, because usually I have valuable stuff in them and are easy to carry. This does not apply to a large bag (in which I don't put valuable stuff).
The standard is mine and I follow it. The same way I find absurd not to do it, you find it unreasonable to do it.
I find the expectation that things are not stolen (if unsupervised in public places) strange considering the huge amount of inequalities in wealth around even in civilized countries. I do not agree with the idea of stealing, thiefs should be punished, but expecting everybody "to behave" given the situation seems unrealistic to me.
That does not mean that I think that things are stolen 100% of the time. I have a friend that forgot a laptop on a bus (Netherlands) and the driver found it at the end of the line and gave it to lost objects so my friend got it back.
I mean, that's great for you, but it's not just what 99% of people do. You don't usually see people take their backpack into a train bathroom. I've taken a lot of trains and sat near the bathroom often enough (unfortunately). But like I said, it applies to the cafe car too.
If you find it absurd how 99% of people act on long-distance trains, I don't know what to tell you.
Ok - that's really poor opsec. If I'm going to the bathroom in a train with my laptop (whether it's expensive or not - it has access to all my stuff - which is arguably more valuable), I'll sleep it, put it in my backpack and take the backpack to the bathroom with me.
My work policies state you simply cannot leave your laptop out of sight for any period unless it's in a secure location (work|home). I feel the same way for my personal laptop as well.
You don't hear much about laptop thefts these days because phones are more valuable, more numerous, and much easier to steal.
Obviously, nobody steals things while the train is in motion. They wait until the train is about to leave the station, snatch a phone or handbag and jump out just as the door is closing. The train leaves, the thief blends in with other passenger leaving the station, and by the time news of the theft has made it from the passengers to the driver to the station staff the thief is long gone.
Of course people drive around $6,000+ cars all the time, so....
> Obviously, nobody steals things while the train is in motion.
Something interesting: I live near a train line where the doors are not automatic (they have to be opened manually on each stop), and there have been incidents where people get pickpocketed while the train is still in motion, and the thief jumps out right before the station, when the train has slowed down significantly but is still in motion. Many people have been hurt doing this.
People stabbed maybe, but that tends to be more sports related than laptop related. (Yes on a national line(!))
Yeah, a long, enclosed space with no exits is more amenable to drunken violence than petty theft.
Only on Hacker News would I have someone arguing with me that laptop theft is not a concern. You know what, you win. It's your $6,000 laptop, not mine.
A $6000 laptop doesn’t look much different than a $1000 laptop. I don’t think it’s a bigger theft risk than any other laptop.
Make sure the laptop is insured and that full disk encryption is enabled. If it’s a Mac, make sure you have it in Find My so you can wipe it remotely if that’s something you worry about.
Honestly, I didn't bother making a better case for why I wouldn't want a $6,000 laptop in large part because the nerve people have to argue that theft isn't a concern at all made me stubborn. Theft is one reason, but a laptop is also a hell of a lot easier to simply break or lose than a desktop that is permanently installed somewhere, and a desktop is more upgradable and repairable, with typically much more I/O.
Today's baseline laptops are really good as it is. 32-64 GiB of RAM is plenty, and at least on PC laptops you can do it fairly cheaply. Apple has been a consistent year or two ahead in mobile CPU performance but it fell out of my consideration ever since I realized the M1 and 7040 were both very sufficient for any local computation I cared about. (I'm not going to say I'd specifically go for less efficiency or performance, but it has become significantly lower priority over other things like repairability.)
Not really specifically hating on Apple, here. If I was going to get another Mac it'd be a Mac Mini or Mac Studio probably, ideally with a third-party SSD upgrade to both save on costs and get a slight bit of extra drive performance too. I've definitely considered it, even though I am very far from an Apple fan, just due to the superior value and efficiency they have in many categories.
Yes! This goes in my forthcoming blog post "Only on Hacker News..."
Yesterday's entry: "... kind of a mind flex that you noted you used Meta Stories glasses to take that photo."
So, zero times then. Ok!
For what it's worth, I never once insinuated that a laptop would get stolen on a train, only that I wouldn't want to bring such a laptop into the public in the first place. (Presumably, the laptop doesn't come into and exit existence upon entering and exiting the train, so this remains somewhat of a concern even if trains are involved.)
But yes, you're right. I've never personally seen a laptop get stolen. In fact, most people who have their laptop get stolen never see their laptop get stolen either.
I have, however, had coworkers who've had their laptops stolen. Multiple times.
it must be amazing to have so much faith in people like you seem to have
For real? Grab it before the door closes.
> Can I use that on the train though? I can with a 128GB MacBook, without it sounding like a helicopter taking off as well.
What kind of timescale do you expect to be able to train a useful LLM with that?
Well it’s about an hour to commute on the train so I guess that long :3
If you have an internet connection then sure you can?
You can use a desktop computer on a train if it's one with power outlets. Might get some funny looks, but I've seen it happen (or at least pictures). :)
Only time I've seen that done was with assistive tech and I do sympathise that those setups are difficult enough with desktops
>> While 192GB of ram is appealing, it's also quite expensive at $6000.
That's because it's Apple. It time to start moving to AMD systems with shared memory. My Zen 3 APU system has 64GB these days and its a mini ITX board.
What is the performance in ML workloads like on AMD APUs compared to Apple Silicon?
The power requirement for 5x5090s is 10x higher , so you'll spend far more than $6000 in electricity over time.
5x 3090 is also much more power hungry?
For personal usage, does it matter though? In most places residential electricity is cheap compared to everything else. In a DC context I feel it matters a lot more compared to the capex.
1x 3090 (350W power limit) already makes it feel like I'm running a fan heater under my desk, 5x would be nuts.
When running inference workloads via something like llama.cpp, only 1 GPU is ever used at a time, so you would have 1 active GPU and 4 idle GPUs. That should make the power usage less insane in practice than you expect.
Place and time your use right, and you'll save a bit on heating at winter and/or at nights.
I think the last time any of my computers had a case was back when I realized the pair of 900gx2 cards I was running was turning my computer into an easy bake.
The good thing is since MoEs are mainly memory bound, we just need (VRAM + RAM) to be in the range of 80GB or so in my tests for at least 5 tokens or so /s.
It's better to get (VRAM + RAM) >= 140GB for at least 30 to 40 tokens/s, and if VRAM >= 140GB, then it can approach 140 tokens/s!
Another trick is to accept more than 8 experts per pass - it'll be slower, but might be more accurate. You could even try reducing the # of experts to say 6 or 7 for low FLOP machines!
Oh yes 192GB machines should be able these quants (131GB for 1.58bit, 158GB for 1.73bit, 183GB for 2.22bit) well :)
Great release Daniel. Applaud the consistency you have shown.
Can you release slightly bigger quant versions? Would enjoy something that runs well on 8x32 v100 and 8x80 A100.
Thanks! Oh I did release 4bit quants, 5bit, 6bit etc all at https://huggingface.co/unsloth/DeepSeek-R1-GGUF if that helps - they're not dynamic though but it should function fine :)
Yes, shared memory is a pretty big leg up since it lets the GPU process the whole model even if the bandwidth is slower which still has some benefits.
Apple's M chips, AMD's Strix Point/Halo chips, Intel's Arc iGPUs, Nvidia's Jetsons. The main issue with all of these though is the lack of raw compute to complement the ability to load insanely large models.
So I'm thinking, inference seems mostly memory bound. With a fast CPU (for example 7950x with 16 cores), and 256GB of RAM (seems to be the max), shouldn't that give you plenty of ability to run the largest models (albeit a bit slowly).
It seems that AMD Epyc CPUs support terabytes of ram, some are as cheap as 1000 EUR. why not just run the full R1 model on that - seems that it would be much cheaper than multiple of those insane NVidia-Karten.
The bottleneck is mainly memory bandwidth. AMD EPYC hw is appealing for local inference because it has a higher memory bandwidth than desktop gear (because 8-12 memory channels vs 2 on almost everything else), but not as fast as the Apple architectures and nowhere near VRAM speeds. If you want to drastically exceed ~3-5 tokens/s on 70b-q4 models, you usually still need GPUs.
This was beautifully illustrated in the recent Phoronix 5090 LLM benchmark[1], which I noted here[2]. The tested GPUs had an almost perfect linear relationship between generated token/s and GB/s memory bandwidth, except the 5090 where it dipped slightly.
I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.
[1]: https://www.phoronix.com/review/nvidia-rtx5090-llama-cpp
[2]: https://news.ycombinator.com/item?id=42847284
It’s more expensive, but Zen4 Threadripper Pro is probably the way to go on that front. 8 memory channels, with DIMMs available up to DDR5-7200 for 8x32GB (256GB), or DDR5-6800 for 8x48GB (384GB). It’ll set you back ~$3k for the RAM and ~$6k for a CPU with 8 CCDs (the 7985WX, at least), and then ~$1k for motherboard and however much you want to spend on NVME. Basically ~$10k for a 384GB DDR5 system with ~435GB/s actual bandwidth. Not quite as fast as the 192GB Apple machines, but twice as much memory and more compute for “only” a few thousand more.
At these prices, I would just get 2xDigits for $6k and have 256gb.
I have a feeling that Digits will probably get sold out and will pricing will get hiked WAY up.
is it confirmed that you can get 256gb of vram for that amount? Because my understanding is that digits pricing will start at $3k for some basic config.
What they meant is buying two whole separate computers.
I understand. It is still unclear if you can get 128GB vram for $3k.
Well, I mean, the press release is pretty unambiguous.
>Each Project DIGITS features 128GB of unified, coherent memory and up to 4TB of NVMe storage.
Even if $3k is only the starting price, it doesn't sound like spending more buys you more memory.
Ok, but it is not clear what kind of RAM is that, how many memory channels, etc. If the goal is to have just 128GB of some ram, then it could be achieved by paying few $100.
Fine, but at that point you're arguing about the concept of the product. It's billed as a computer for AI and you're saying that it might not be more suitable for AI than a regular PC.
it is possible that one could build better PC than digits for AI. We will see once they release digits.
On Zen5 you also get AVX512 which llamafile takes advantage of for drastically improved speeds during prompt processing, at least. And the 12 channel Epycs actually seem to have more memory bandwidth available than the Apple M series. Especially considering it's all available to the CPU as opposed to just some portion of it.
Maybe EPYC can make better use of the available bandwidth, but for comparison I have a water cooled Xeon W5-3435X running at 4.7GHz all-core with 8 channels of DDR5-6400, and CPU inference is still dog slow. With a 70B Q8 model I get 1 tok/s, which is a lot less than I thought I would get with 410GB/s max RAM bandwidth. If I run on 5x A4000s I get 6.1 tok/s, which makes sense... 448GB/s / 70GB = 6.4 tok/s max.
very strange as I get on old i5-12400+DDR4 2 tok/sec with 14B/q8 model.
FWIW Threadrippers go up to 1TB and Threadripper Pro up to 2TB. That's even in the lowest model of each series. (I know this because it happens to be the chip I have. Not saying you shouldn't go for Epyc if it works out better.)
Have you tried running the full R1 model with that? People in sibling comments mention high end EPYCs gor a 10K machine, but I’m curious whether it’s possible to make a 1-2K machine that could still run those big models simply because they fit in RAM.
I spent about $3000 on my machine, have the cheapest Threadripper CPU and 256GB of RAM, so no, 600GB won't fit in RAM on a $2K machine.
But everyone is using the distilled models which are much smaller.
idk, in my daily work i still see o1 being more useful, did you observe both having the same reasoning power?
Wow, an 80% reduction in size for DeepSeek-R1 is just amazing! It's fantastic to see such large models becoming more accessible to those of us who don't have access to top-tier hardware. This kind of optimization opens up so many possibilities for experimenting at home.
I'm impressed by the 140 tokens per second speed with the 1.58-bit quantization running on dual H100s. That kind of performance makes the model practical for small or mid sized shops to use it for local applications. This is a huge win for people working on agents that require low latency that only local models could support.
Btw completely off topic, but your comment triggered the internal classification in my brain, and it looks like AI-generated.
Not accusing you anything. Could be that you happen to write in a way similar to LLMs. Could be that we are influenced by LLM writing styles and are writing more and more like LLMs. Could be that the difference between LLM generated content and human-generated content is getting smaller and harder to tell.
+1 my LLM spidy senses were tingling.
It’s the exclamation point in the first paragraph, the concise and consistent sentence structure, and the lack of colloquial tone.
OP, no worries if you’re real. I often read my own messages or writing and worry that people will think I’m an LLM too.
"This kind of optimization opens up so many possibilities" was what triggered me.
haha you got me. I'm real person using LLM to proofread the stuff I write. English is not my native language and I'm trying to improve my written vocabulary a little bit. Sorry if it reads a little bit too off.
Haha no worries. This is a perfectly valid use case of LLM. I'm happy that the comment sounds very professional and to the point.
Very funny, I didn't mentally jump to LLM, but the language was so lifeless that I stopped reading.
Amazing that OP confirmed you're correct (and good use of LLM @OP).
I was pleasantly surprised by 140 tokens/s as well! I literally thought I did something wrong but it was real!
> Unfortunately if you naively quantize all layers to 1.58bit, you will get infinite repetitions in seed 3407: “Colours with dark Colours with dark Colours with dark Colours with dark Colours with dark” or in seed 3408: “Set up the Pygame's Pygame display with a Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's”.
This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!
Hey! :) Coincidentally the seeds I always use are 3407, 3408 and 3409 :) 3407 because of https://arxiv.org/abs/2109.08203
I also tried not setting the seeds, but the results are still the same - quantizing all layers seems to make the model forget and repeat everything - I put all examples here: https://docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit#...
would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.
Oh yes multiple people have asked me about this - I'll see what I can do :)
Can't this kind of repetition be dealt with at the ~~decoder~~ (edit: sampler) level, like for any models? (see DRY ~~decoder~~ sampler for instance: https://github.com/oobabooga/text-generation-webui/pull/5677)
Oh yes one could provide a repetition penalty for example - the issue is it's not just repetition that's the issue. I find it rather forgets what it already saw, and so hence it repeats stuff - it's probably best to backtrack, then delete the last few rows in the KV cache.
Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)
You likely mean sampler, not decoder. And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy. If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.
> You likely mean sampler, not decoder.
Indeed, that's posting before being fully awake.
> And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy.
OF course you can't recover any accuracy, but LLM are in fact prone to this kind of repetition no matter what, this is a known failure mode that's why samplers aimed at avoiding this have been designed over the past few years.
> If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.
But it won't necessary be gibberish! even a highly quantized R1 has still much more embedded information than a 14 or even 32B model, so I don't see why it should output more gibberish than smaller models.
You can deal with this through various sampling methods, but it doesn't actually fix the fried model.
As someone who is out of the loop, what’s the verdict on R1? Was anyone able to reproduce the results yet? Is the claim that it only took $5M to train generally accepted?
It’s a very bold claim which is really shaking up the markets, so I can’t help but wonder if it was even verified at this point.
> Is the claim that it only took $5M to train generally accepted?
Based on Nvidia being down 18% yesterday I would say the claim is generally accepted.
Because the markets are rational, all-knowing, and have never been wrong?
No, because the market is an aggregate of opinions, so it’s entirely fair to say it’s “generally accepted.” That has nothing to do with whether something happens to be true or not.
It may provide a financial opportunity for someone who disagrees with that aggregated opinion though.
That was not the question.
It is if you're using market movements as evidence of anything factual. If markets aren't rational, you can't use them that way.
do you only take advice/learn from all-knowing people?
Do you know any?
But here's my advice: drop the fallacious arguments and try something more honest.
my argument isn’t fallacious - it is logical: we can learn/use evidence from something without presuming it is all knowing. you are putting words in others mouths that they did not say
I'm sorry, I thought you introduced the "all-knowing" out of nowhere, but this was indeed mentioned by willsmith72. I'd missed that.
Still, his implied assertion that markets that markets can often behave irrationally, and can't be used as evidence of technical matters, seems pretty valid to me.
But I suppose you could see it as a sign that something is at least temporarily "generally accepted" among investors. That doesn't mean it's generally accepted among AI researchers, though.
Although I thought it was $6M rather than $5M, and that that was only the last step, and not the total investment. What does seem to be generally accepted among investors that this isn't good news for NVidia's profits, but that still doesn't mean that all the specific facts are generally accepted.
Don't worry, NVDA will bounce back and you will get a chance to get out.
Efficient market hypothesis is for nerds!
as opposed to HN comments??
It is still unconfirmed since no one outside of deepseek reproduced it.
If confirmed, Nvidia could go down even more
based on information and background they thoroughly gave when releasing their research its pretty easy to put together that it did take them significantly less resources to train this model. only having specific parameters available at a time instead of activating everything all at once is pretty ingenious.
that and they just happened to be undergoing a large scale "cyber attack"
I'm not sure I see the bear argument for NVidia here. Huge AI models certainly drive NVidia sales, but huge AI models are also widely thought to be untrainable and nearly un-runnable save for large datacenters.
To me, this is ripe for an application of the Jevons paradox. If architectural improvements make similar models cheaper, I would expect to see more of them trained and deployed, not fewer, ultimately increasing the market for GPU-like hardware.
While Deepseek was an instigator in the price movements I would not say its accepted.
I don’t see them as related. The market moves when there is money to be made. It’s only tangentially related to any kind of general sentiment.
“I don’t believe this, but I know others will, so I’m selling”
> Nvidia being down 18%
The only part of DeepSeek-R1 I do not like. I hope it's over, but I am not holding my breath.
Nvidia is now up only 1906% over 5 years. What a disgrace
It crashed all the way back to June 2024 levels, eons of progress wiped out
Huggingface is working on reproducing it: https://github.com/huggingface/open-r1
They claimed that it only took $5 million of GPUs to train Deepseek v3, which was the base model. They did not claim that the total costs were $5 million. They omitted the costs of additional hardware, electricity, personnel, training dataset acquisition, etcetera. They likely spent tens of times more on this at a minimum.
That said, what they did with $5 million of GPUs is impressive. Reportedly, they resorted to using PTX assembly to make it possible:
https://www.tomshardware.com/tech-industry/artificial-intell...
I think the jury is out. With folks trying to replicate the process we will see if the low budget is true or not. I am still on the fence, there was comments from Scale CEO that they have a huge number of H100s they used. On the market side I think regardless if this was true or not, this gave people the opportunity to sell what is perhaps overinflated valuations.
That's likely only the marginal cost of training this model, and doesn't include a lot of other costs, like the datacenters and GPUs themselves which they already had and also the staff.
If they aren't lying because they have hardware they're not supposed to have, which is also a possibility.
these claims are getting more wrong every time i see them, weird game of telephone going around tech circles.
the cost absolutely includes the cost of GPUs and data centers, they quoted a standard price for renting h800 which has all of this built in. but yes, as very explicitly noted in the paper, it does not include cost of test iterations
r1 probably cost way less to train, $5m is the alleged price tag for dsv3
>For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.
Oh nice! So I can try it in my local "low power/low cost" server at home.
My homesystem does run in a ryzen 5500 + 64gb RAM + 7x RTX 3060 12gb
So 64gb RAM plus 84gb VRAM
I dont want to brag around, but point to solutions for us tinkerers with a small budget and high energy costs.
such system can be build for around 1600 euro. The power consumption is around 520 watt.
I started with a AM4 Board (b450 Chipset) and one used RTX 3060 12gb which cost around 200 Euro used if you are patient.
There every additional GPU is connected with the pcie riser/extender to give the cards enough space.
After a while I had replaces the pcie cards with a single pcie x4 to 6x PCIe x1 extender.
It runs pretty nice. Awesome to learn and gain experience
How are you arriving at those numbers?
ryzen 5500 + 7x3060 + cooling ~= 1.6 kW off the wall, at 360 GB/s memory bandwidth, and considering your lane budget, most of it will be wasted in single PCIe lanes. After-market unit price of 3060's is 200 eur, so 1600 is not good-faith cost estimate.
From the looks of it, your setup is neither low-power, nor low-cost. You'd be better served with a refurbished mac studio (2022) at 400GB/s bandwidth fully utilised over 96 GB memory. Yes, it will cost you 50% more (considering real cost of such system closer to 2000 eur) however it would run at a fraction of power use (10x less, more or less)
I get it that hobbyists like to build PC's, but claiming that sticking seven five year out of date low-bandwidth GPU's in a box is "low power/low cost" is a silly proposition.
You're advocating for e-waste
The issue is that you are taking max GPU power draw, as a given. Running a LLM does not tax a GPU the same way a game does. There is a rather know Youtuber, that ran LLMs on a 4090, and the actual power draw was only 130W on the GPU.
Now add that this guy has 7x3060 = 100% miner. So you know that he is running a optimized profile (underclocked).
Fyi, my gaming 6800 draws 230W, but with a bit of undervolting and sacrificing 7% performance, it runs at 110W for the exact same load. And that is 100% taxed. This is just a simple example to show that a lot of PC hardware runs very much overclocked/unoptimized out of the box.
Somebody getting down to 520W sounds perfectly normal, for a undervolted card that gives up maybe 10% performance, for big gains in power draw.
And no, old hardware can be extreme useful in the right hands. Add to this, its the main factor that influences the speed tends to be more memory usage (the more you can fit and the interconnects), then actual processing performance for running a LLM.
Being able to run a large model for 1600 sounds like a bargain to me. Also, remember, when your not querying the models, the power will be mostly the memory wakes + power regulators. Coming back to that youtuber, he was not constantly drawing that 130W, it was only with spikes when he ran prompts or did activity.
Yes, running from home will be more expensive then a 10$ copilot plan but ... nobody is also looking at your data ;)
Thanks for the clarification. Surely, If I run hashcat benchmark the power consumption goes nearly to 1400 Watt, but I also limited the max power consumption for each card to 100 Watt, which worked out better than limiting the max gpu frequency. To be fair, the most speed comes from the RAM frequency - as long as this is not limited, it works out great.
I took a fair amount of time to get everything to a reduced power level and measured several llm models (and hashcat for the extreme) to find the best speed per watt, which is usally around 1700-1900 mhz or limiting 3060 to 100 to 115 watt.
If I planned it in the first run, I may got away with a used mac studio, thats right. However, I incrementally added more cards as I moved further into exploration.
I didn't wanted to confront someone, but it looks like you either show of 4x 4090 or you keep silent
np ;)
I am amazed these days people lacking knowledge about hardware, and the mass benefits of undervolting/power limiting hardware. Its like people do not realize that what is sold, is often overclocked/too high vcore. The amount of people i see buying insane overspec PSUs, and go O_o ...
How is your performance with the different models on your setup?
"Undervolting" is a thing for 3090s where they get them down from 350 to 300W at 5% perf drop but for your case it's irrelevant because your lane budget is far too little!
> know Youtuber, that ran LLMs on a 4090, and the actual power draw was only 130W on the GPU.
Well, let's see his video. He must be using some really inefficient backend implementation if the GPU wasn't utilised like that.
I'm not running e-waste. My cards are L40S and even in basic inference, no batching with ggml cuda kernels they get to 70% util immediately.
For anyone wondering why "1.58" bits: 2^1.58496... = 3. The weights have one of the three states {-1, 0, 1}.
They say something else:
> We managed to selectively quantize certain layers to higher bits (like 4bit), and leave most MoE layers (like those used in GPT-4) to 1.5bit
That was just improper rounding from 1.58 to 1.5. They say 1.58 in other places and explicitly link to https://arxiv.org/abs/2402.17764
Would be great if the next generation of base models was designed to be inferred with 128GB of VRAM while 8bit quantized (which would fit in the consumer hardware class).
For example, I imagine a strong MoE base with 16 billion active parameters and 6 or 7 experts would keep a good performance while being possible to run on 128GB RAM macbooks.
So I remember Deepseek used float8 for training - Character AI also used int8 for training - it is indeed possible, but sometimes training can be unstable - Deepseek to my knowledge is actually the first lab to use float8 at a large scale without causing loss spikes - they used FP8 tensor cores, then every 4th matrix multiply, they accumulated to a FP32 accumulator - it seems like the Hopper Tensor Cores accumulation mechanism might not be actual FP32 accumulation. I wrote more here: https://x.com/danielhanchen/status/1872719599029850391
Would be great, but unfortunately i think intelligence at that compute scale will be limit by hardware not its model. Though at hardware limit I would expect it to be roughly human level especially if optimized for a particular domain.
I remember that Llama 3 was trained on data curated by Llama 2 and it resulted in a model with a significant performance boost (even though it was trained by a previous generation model of the same size).
Maybe using a strong reasoning model such as R1 the next generation, even more performance can be extracted from smaller models.
That's already happening, and is in fact even part of the R1 training pipeline. An intermediate small reasoning model churns out training data for RL a larger model, rinse and repeat. Deepseek also showed model distillation with synthetic reasoning data to work quite well.
It’s a pretty neat paradigm and I see an abstract connection to how brains dream and produce their own synthetic training data while sleeping that supplements their real data used while awake.
Is your first claim in the R1 paper? I didn't see it when I looked
Danielhanchen, your work is continually impressive. Unsloth is great, and I’m repeatedly amazed at your ability to get up to speed on a new model within hours of its release, and often fix bugs in the default implementation. At this point, I think serious labs should give you a few hour head start just to iron out their kinks!
Oh thanks a lot! Appreciate it :) We're always open to collaborating with anyone!
The size reduction while keeping the model coherent is incredible. But I'm skeptical of how much effectiveness was retained. Flappy bird is well known and the kind of thing a non-reasoning model could het right. A better test would be something off the beaten path that R1 and o1 get right that other models don't.
yeah it is pretty unclear how lobotomized it is without benchmark.
i’ve gotten full fp8 running on 8xh100, probably going to keep doing that
The size reduction is impressive but unless I missed it, they don't list any standard benchmarks for comparison so we have no way to tell how it compares to the full-size model.
> DeepSeek-R1 has been making waves recently by rivaling OpenAI's O1 reasoning model while being fully open-source.
Do we finally have a model with access to the training architecture and training data set, or are we still calling non-reproducible binary blobs without source form open-source?
It sounds like if they owe you the training architecture and training data set.
It absolutely doesn't. It sounds like further diluting the term "open-source" isn't great.
I assume when people say "open source model" they mean "open weights model". The "open source" term doesn't really make sense here, since machine learning models are not compilations of source code. (Though DeepSeek has published several papers with details on their training process. It's more than just open weights.)
ML models do have a "source" though
If ML models have a source, brains have a source.
Brains don't have a source.
Therefore, ML models don't have a source.
If I invested in a 100x machine because I needed 100 of x to run, and somebody shows how 10x can work, why have I not just become the holder of 10 10x machines, and therefore have already achieved capex to exploit this new market?
I cannot understand why "openai is dead" has legs: repurpose the hardware and data and it can be multiple instances of the more efficient model.
because of discounted cash flow/valuation models.
you invest in a 100x machine expecting a revenue of X, but now you can only charge X/100 because R1 shows that AI inference can be done much more efficiently. see the price decrease of ChatGPT and addition of free O3 etc.
this reduction of future cash flows, ceteris paribus, implies that the present value of these cash flows decrease. this then results in massive repricing to the downside as market participants update their forecasts.
what you are missing is that to assume as you do, you must make the additional assumption that demand for additional compute is infinite. Which may very well be the case, but it is not guaranteed compared to the present realized fact that R1 means lower revenues for AI inference providers -> changes the capex justification for even more hardware -> NVDA receives less revenue.
Thanks. Always a mistake to assume the price of something is bound to your own cost of doing it: the price is bound to the other guys cheaper price!
Flappy Bird in Python is the new Turing test
:) It's my goto test :) I did amp it up by adding 10 conditions and made a scoring card - I found the original R1 to sometimes forget "import os" or miss some lines as well, so I thought it was at least a good check!
I also like to ask the models to create a simple basic Minecraft type game where you can break pieces and store them in your inventory, but disallow building stuff
I feel any AI can fix those problems when they can finally act. The problem AIs cannot run or debug code, or even book a hotel for me. When that is solved and an AI can interact with the code like a human does, it can fix its problems like a human does.
Exactly! Why can’t LLMs run their own code?
they can, feel free to inference and give it an interpreter
Rampancy.
hopefully we eventually push them to make more classic games like motherlode
In general, how do you run these big models on cloud hardware? Do you cut them up layer-wise and run slices of layers on individual A100/H100s?
My understanding is with MoE (Mixture of Experts), you can and should shard it horizontally. The whole model is 600GB, but only 37GB is active during the evaluation of any single output token.
So you can load a different active subset of the MoE into each 89GB GPU, sharding it across something like 32 different GPUs (or can you get away with less? Wouldn't be surprised if they can infer on 8x H800 gpus). Some parameters are common, others are independent. Queries can be dynamically routed between GPUs, potentially bouncing between GPUs as much as once per output token, depending on which experts they need to activate.
Though, I suspect it's normal to stick on one MoE subset for several output tokens.
This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs.
Each MoE layer has its own router, and it activates 8 (out of 256) experts at a time. There's no reason to expect all of them to stay on the same GPU, so you're pretty much guaranteed to have to do all-to-all communication between the GPUs in your cluster after every layer for every token.
Interesting.
I had assumed the performance advantage for MoE came from minimising traffic between GPUs. But if it's per layer routing, then it's going to massively increase inter-gpu traffic compared to vertical slicing.
I guess that means the performance advantage actually comes when batching thousands of queries? The MoE routing would mean that on each MoE layer, each GPU shard gets a batch of queries that will all hit roughly the same subset of experts (and read the same weights from memory). The batches then shuffle between each MoE layer to re-optimise.
It's kind of like GPU raytracing where you get large performance gains by running coherency sorting on rays and batching similar rays together.
The performance advantage comes from doing 1/32 of the floating point operations compared to a dense layer with the same number of parameters.
The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.
There are a few ways - the most basic is per layer sharding - DeepSeek uses 3 dense layers, so that can stay on GPU0 (with the embedding layer). There's 58 MoE layers (256 experts, 8 activated) and 1 shared expert per layer. GPU1 would house layers 3 to 9, and so on.
Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.
The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!
You could do that, and add pipelining to improve speed.
Was wondering the same, but for HPC clusters :)
Has it been tried on 128GB M4 MacBook Pro? I'm gonna try it, but I guess it will be too slow to be usable.
I love the original DeepSeek model, but the distilled versions are too dumb usually. I'm excited to try my own queries on it.
I'm downloading it now and will report back.
(I've been using the 32B and while it could always be better, I'm not unhappy with it)
How'd it go, and which client are you using? :)
Pretty rough.
Using LM Studio, trying to load the model throws an error of "insufficient system resources."
I disabled this error, set the context length to 1024 and was able to get 0.24 tokens per second. Comparatively, the 32B distill model gets about 20 tokens per second.
And it became incredibly flaky, using up all available ram, and crashing the whole system a few times.
While the M4 Max 128GB handles the 32B well, it seems to choke on this. Here's to hoping someone works on something in-between (or works out what the ideal settings are because nothing I fiddled with helped much).
There's a terminal command to increase the maximum vram MacOS can use, you can try that as you're probably going over the limit and the system is resorting to treat as system ram. (I ran into this problem a couple of times using ollama).
Maybe VLLM is better at inferencing MoE (also you can set the number of experts to use).
In theory half of the model fits to RAM, so it should be GPU limited if memory management is smart.
Is this actually 1.58 bits? (Log base 2 of 3) I heard of another "1.58 bit" model that actually used 2 bits instead. "1.6 bit" is easy enough, you can pack five 3-state values into a byte by using values 0-242. Then unpacking is easy, you divide and modulo by 3 up to five times (or use a lookup table).
Just ask it about Taiwan (not kidding). I'm not sure I can trust a model that has such a focused political agenda.
Is this akin to the quants already being done to various models when you download a GGUF at 4 bits for example, or is this variable layer compression something new that can also be make existing smaller models smaller so we can fit more into say 12 or 16 gb's of vram?
Big fan of unsloth, they have huge potential, could definitely need some experienced GTM people though, IMO. The pricing page and messages sent there are really not good.
Oh thanks :) Yes agreed we do need better GTM - temporarily it's still me and my brother running Unsloth, so for now we're just prioritizing many more engineering releases :)
It would be really useful to see these evaluated across some of the same evals that the original R1 and deepseek's distills were evaluated on.
Incredible work by the Unsloth brothers again. It’s really cool to see bitnet quantization implemented like this.
DeepSeek R1 in a nutshell
youtube.com/watch?v=Nl7aCUsWykg
Is there any small DS or qwen model that could run on say an M4 Mac Mini Standard (16G) ?
Thanks for the run instructions, unsloth. Deepseek is so new it's been breaking most of my builds.
This is an important step. Especially for beginners or people who are not in the loop, being able to easily type some simple commands to download, install dependencies, compile and run everything needed for a LLM AI model gives a feeling sci-fi; it's almost like you can have a helping brain at home.
One thing I've being thinking about doing is to combine one of those LLM models running in llama.cpp, feed it with the output of whisper.cpp and connect its output to some TTS model. I wonder how far from Wheels and Roadie from the Pole Position tv series.
Glad they were helpful! :)
How can you have a bit and a half exactly? It doesn't make sense.
It's not a bit and a half. It is 1.58 or really log(3) / log(2) since it allows for three values, -1, 0 an 1
Is the new LLM benchmark to create flappy bird in pygame?
Can I run this on ollama?
Yes, the instructions are in the OP.
Is there any good quick summary of what's special about DeepSeek? I know it's OSS and incredibly efficient, but news laymen are saying it's trained purely on AI info instead of using a corpus of tagged data... which, I assume, means it's somehow extracting weights or metadata or something from other AIs. Is that it?
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSee...
see also https://news.ycombinator.com/item?id=42846588
It is going to be truly fucking revolutionary if open-source models are and continue to be able to challenge the state of the art. My big philosophical concern is that AI locks Capital into an absolutely supreme and insurmountable lead over Labour, and into the hands of oligarchs, and the possibility of a future where that's not case feels amazing. It pleases me greatly that this has Trump riled up too, because I think it means he's much less likely to allow existing US model-makers to build moats, as I think he's -- even as a man who I don't think believes in very much -- absolutely unwilling to let the Chinese get the drop on him over this.
I have no doubt open source will catch up (it already has, eh?) at the end of the day, it's just creative / new iterations on what is ultimately the transformer architecture... the amount of "secret" moat-like stuff that OpenAI was doing was bound to be figured out or exceeded eventually, like everything in tech...
Not to make fun of OpenAI and the great work they've done but it's kinda like if I went out in the 90s and said I'm going to found a company to have the best REST APIs. You can always found a successful tech company, but you can't found a successful tech company on a technological architecture or pattern alone.
site is javascript walled
80%? On 2 H100 only? To get near chatgpt 4? Seriously? The 671B version??
they have not benchmarked the quantized model.
> site is javascript walled
I use Qubes OS to protect myself from the JS.
That site should work with a noscript/basic (x)html browser.
[dead]
Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .
I 100% expect some downvotes from the ccp.
And the United States subsidizes plenty of things too. For example the CHIPS act has $39 billion in subsidies for chip manufacturing on U.S. soil. There's nothing wrong with either country's subsidies. I personally don't believe in maximum free market. Government subsidy is more often than not a good thing and we need more of them both here and in China.
> Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .
And that's a really important strategic advantage China has versus America, which has such an insane fixation on pure(ish) free markets and free trade that it gives away its advantages in strategic industry after strategic industry.
Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.
It’s false except for every time that it has been true
> And that's a really important strategic advantage China has versus America, which has such an insane fixation on pure(ish) free markets and free trade that it gives away its advantages in strategic industry after strategic industry.
> Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.
The data we have is 500 years of free markets in the western world and the verdict is overwhelmingly: Yes, more freedom means more winning.
Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.
> The data we have is 500 years of free markets in the western world and the verdict is overwhelmingly: Yes, more freedom means more winning.
No, more freedom means more winning to a point. Past that point it does not, and I'd argue that's where the US is.
> Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.
That's supposed to be convincing, somehow? Just invite some "competent" capitalist over to your house, and he'll sell your fishing rod in exchange for a short-term discount on fish at the supermarket, and see how well you win.
> I 100% expect some downvotes from the ccp.
Always happy to oblige when someone insinuates that any critics must be government agents
Naa just wanted to see who would bite.