Don’t let the “flash” name fool you, this is an amazing model. I have been playi...

thecupisblue · 2025-12-17T17:23:46 1765992226

Oh wow - I recently tried 3 Pro preview and it was too slow for me.

After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.

The results are better AND the response times have stayed the same. What an insane gain - especially considering the price compared to 2.5 Pro. I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.

Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.

lancekey · 2025-12-18T00:38:12 1766018292

Curious to learn what a “product benchmark” looks like. Is it evals you use to test prompts/models? A third party tool?

Examples from the wild are a great learning tool, anything you’re able to share is appreciated.

lambda · 2025-12-17T21:27:17 1766006837

I'm a significant genAI skeptic.

I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.

Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.

So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).

So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.

Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.

prettyblocks · 2025-12-17T22:38:43 1766011123

I don't think tricky niche knowledge is the sweet spot for genai and it likely won't be for some time. Instead, it's a great replacement for rote tasks where a less than perfect performance is good enough. Transcription, ocr, boilerplate code generation, etc.

lambda · 2025-12-17T23:52:44 1766015564

The thing is, I see people use it for tricky niche knowledge all the time; using it as an alternative to doing a Google search.

So I want to have a general idea of how good it is at this.

I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.

But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.

Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.

mikepurvis · 2025-12-18T00:00:36 1766016036

And Google themselves obviously believe that too as they happily insert AI summaries at the top of most serps now.

ComputerGuru · 2025-12-18T00:04:05 1766016245

Or maybe Google knows most people search inane, obvious things?

ozim · 2025-12-17T23:07:01 1766012821

Second this.

Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.

I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.

andai · 2025-12-17T23:41:41 1766014901

So this is an interesting benchmark, because if the answer is actually in the top 3 google results, then my python script that runs a google search, scrapes the top n results and shoves them into a crappy LLM would pass your benchmark too!

Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)

lambda · 2025-12-17T23:55:24 1766015724

I've tried doing this query with search enabled in LLMs before, which is supposed to effectively do that, and even with that they didn't give very good answers. It's a very physical kind of thing, and its easy to conflate with other similar descriptions, so they would frequently just conflate various different things and give some horrible mash-up answer that wasn't about the specific thing I'd asked about.

fragmede · 2025-12-17T22:46:33 1766011593

Even the most magical wonderful auto-hammer is gonna be bad at driving in screws. And, in this analogy I can't fault you because there are people trying to sell this hammer as a screwdriver. My opinion is that it's important to not lose sight of the places where it is useful because of the places where it isn't.

pretzellogician · 2025-12-17T23:10:39 1766013039

Funny, I grew up using what's called a "hand impact screwdriver"... turns out a hammer can be used to drive in screws!

TeodorDyakov · 2025-12-17T22:37:22 1766011042

Hi. I am curious what was the benchmark question? Cheers!

Turskarama · 2025-12-17T22:46:00 1766011560

The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark.

grog454 · 2025-12-18T00:11:28 1766016688

This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN.

What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.

Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.

Turskarama · 2025-12-18T00:28:43 1766017723

The point is that it's a litmus test for how well the models do with niche knowledge _in general_. The point isn't really to know how well the model works for that specific niche. Ideally of course you would use a few of them and aggregate the results.

lambda · 2025-12-17T22:55:43 1766012143

Yeah, that's part of why I don't disclose.

Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.

kridsdale3 · 2025-12-17T23:00:45 1766012445

If they told you, it would be picked up in a future model's training run.

jacobn · 2025-12-17T23:17:47 1766013467

Don't the models typically train on their input too? I.e. submitting the question also carries a risk/chance of it getting picked up?

I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?

jerojero · 2025-12-17T23:36:11 1766014571

they probably dont train on inputs from testing grounds.

you dont train on your test data because you need to have that to compare if training is improving or not.

energy123 · 2025-12-17T23:24:21 1766013861

Given they asked in on LMArena, yes.

lambda · 2025-12-17T23:57:34 1766015854

Yeah, probably asking on LMArena makes this an invalid benchmark going forward, especially since I think Google is particular active in testing models on LMArena (as evidenced by the fact that I got their preview for this question).

I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.

_heimdall · 2025-12-17T23:18:29 1766013509

Is that an issue if you now need a new question to ask?

mips_avatar · 2025-12-17T21:58:16 1766008696

OpenAI made a huge mistake neglecting fast inferencing models. Their strategy was gpt 5 for everything, which hasn't worked out at all. I'm really not sure what model OpenAI wants me to use for my applications that require lower latency. If I follow their advice in their API docs about which models I should use for faster responses I get told either use GPT 5 low thinking, or replace gpt 5 with gpt 4.1, or switch to the mini model. Now as a developer I'm doing evals on all three of these combinations. I'm running my evals on gemini 3 flash right now, and it's outperforming gpt5 thinking without thinking. OpenAI should stop trying to come up with ads and make models that are useful.

andai · 2025-12-17T23:51:45 1766015505

Hard to find info but I think the -chat versions of 5.1 and 5.2 (gpt-5.2-chat) are what you're looking for. They might just be an alias for the same model with very low reasoning though. I've seen other providers do the same thing, where they offer a reasoning and non reasoning endpoint. Seems to work well enough.

ComputerGuru · 2025-12-18T00:08:17 1766016497

They’re not the same, there are (at least) two different tunes per 5.x

For each you can use it as “instant” supposedly without thinking (though these are all exclusively reasoning models) or specify a reasoning amount (low, medium, high, and now xhigh - though if you do g specify it defaults to none) OR you can use the -chat version which is also “no thinking” but in practice performs markedly differently from the regular version with thinking off (not more or less intelligent but has a different style and answering method).

mips_avatar · 2025-12-18T00:02:11 1766016131

It's weird they don't document this stuff. Like understanding things like tool call latency and time to first token is extremely important in application development.

danpalmer · 2025-12-17T22:35:47 1766010947

Hardware is a factor here. GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. There are lots of other factors here, but latency specifically favours TPUs.

The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

behnamoh · 2025-12-17T23:40:50 1766014850

> OpenAI made a huge mistake neglecting fast inferencing models.

It's a lost battle. It'll always be cheaper to use an open source model hosted by others like together/fireworks/deepinfra/etc.

I've been maining Mistral lately for low latency stuff and the price-quality is hard to beat.

mips_avatar · 2025-12-18T00:04:16 1766016256

I'll try benchmarking mistral against my eval, I've been impressed by kimi's importance but it's too slow to do anything useful realtime.

simonw · 2025-12-17T22:09:09 1766009349

Yeah, I'm surprised that they've been through GT-5.1 and GPT-5.1-Codex and GPT-5.1-Codex-Max and now GPT-5.2 but their most recent mini model is still GPT-5-mini.

mips_avatar · 2025-12-18T00:05:28 1766016328

I cannot comprehend how they do not care about this segment of the market.

yakbarber · 2025-12-18T00:56:02 1766019362

it's easy to comprehend actually. they're putting everything on "having the best model". It doesn't look like they're going to win, but that's still their bet/

TacticalCoder · 2025-12-18T00:36:31 1766018191

> OpenAI should stop trying to come up with ads and make models that are useful.

Turns out becoming a $4 trillion company first with ads (Google), then owning everybody on the AI-front could be the winning strategy.

scrollop · 2025-12-17T19:58:07 1766001487

Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.

https://artificialanalysis.ai/evaluations/omniscience

tallclair · 2025-12-17T22:57:36 1766012256

On your Omniscience-Index vs. Cost graph, I think your Gemini 3 pro & flash models might be swapped.

giancarlostoro · 2025-12-17T20:06:27 1766001987

I wonder at what point will everyone who over-invested in OpenAI will regret their decision (expect maybe Nvidia?). Maybe Microsoft doesn't need to care, they get to sell their models via Azure.

toomuchtodo · 2025-12-17T20:32:29 1766003549

Amazon Set to Waste $10 Billion on OpenAI - https://finance.yahoo.com/news/amazon-set-waste-10-billion-1... - December 17th, 2025

outside1234 · 2025-12-17T21:21:46 1766006506

Very soon, because clearly OpenAI is in very serious trouble. They are scaled and have no business model and a competitor that is much better than them at almost everything (ads, hardware, cloud, consumer, scaling).

TacticalCoder · 2025-12-18T00:51:13 1766019073

Oracle's stock skyrocketed then took a nosedive. Financial experts warned that companies who bet big on OpenAI like Oracle and Coreweave to pump their stock would go down the drain, and down the drain they went (so far: -65% for Coreweave and nearly -50% of Oracle compared to their OpenAI-hype all-time highs).

Markets seems to be in a: "Show me the OpenAI money" mood at the moment.

And even financial commentators who don't necessarily know a thing about AI can realize that Gemini 3 Pro and now Gemini 3 Flash are giving ChatGPT a run for its money.

Oracle and Microsoft have other source of revenues but for those really drinking the OpenAI koolaid, including OpenAI itself, I sure as heck don't know what the future holds.

My safe bet however is that Google ain't going anywhere and shall keep progressing on the AI front at an insane pace.

guelo · 2025-12-17T22:33:33 1766010813

OpenAI's doom was written when Altman (and Nadella) got greedy, threw away the nonprofit mission, and caused the exodus of talent and funding that created Anthropic. If they had stayed nonprofit the rest of the industry could have consolidated their efforts against Google's juggernaut. I don't understand how they expected to sustain the advantage against Google's infinite money machine. With Waymo Google showed that they're willing to burn money for decades until they succeed.

This story also shows the market corruption of Google's monopolies, but a judge recently gave them his stamp of approval so we're stuck with it for the foreseeable future.

behnamoh · 2025-12-17T23:43:59 1766015039

> I don't understand how they expected to sustain the advantage against Google's infinite money machine.

I ask this question about Nazi Germany. They adopted the Blitkrieg strategy and expanded unsustainably, but it was only a matter of time until powers with infinite resources (US, USSR) put an end to it.

goobatrooba · 2025-12-18T00:25:03 1766017503

I know you're making an analogy but I have to point out that there are many points where Nazi Germany could have gone a different route and potentially could have ended up with a stable dominion over much of Western Europe.

Most obvious decision points were betraying the USSR and declaring war on the US (no one really had been able to print the reason, but presumably it was to get Japan to attack the soviets from the other side, which then however didn't happen). Another could have been to consolidate after the surrender/supplication of France, rather than continue attacking further.

jack_riminton · 2025-12-17T22:18:13 1766009893

But you’re forgetting the Jonny Ive hardware device that totally isn’t like that laughable pin badge thing from Humane

/s

mmaunder · 2025-12-17T18:21:31 1765995691

Thanks, having it walk a hardcore SDR signal chain right now --- oh damn it just finished. The blog post makes it clear this isn't just some 'lite' model - you get low latency and cognitive performance. really appreciate you amplifying that.

behnamoh · 2025-12-17T23:45:35 1766015135

> Don’t let the “flash” name fool you

I think it's bad naming on google's part. "flash" implies low quality, fast but not good enough. I get less negative feeling looking at "mini" models.

pietz · 2025-12-17T23:56:37 1766015797

Interesting. Flash suggests more power to me than Mini. I never use gpt-5-mini in the UI whereas Flash appears to be just as good as Pro just a lot faster.

nemonemo · 2025-12-17T23:51:43 1766015503

Fair point. Asked Gemini to suggest alternatives, and it suggested Gemini Velocity, Gemini Atom, Gemini Axiom (and more). I would have liked `Gemini Velocity`.

esafak · 2025-12-17T16:58:23 1765990703

What are you using it for and what were you using before?

epolanski · 2025-12-17T17:40:43 1765993243

Gemini 2.0 flash was good already for some tasks of mine long time ago..

unsupp0rted · 2025-12-17T18:34:42 1765996482

How good is it for coding, relative to recent frontier models like GPT 5.x, Sonnet 4.x, etc?

bovermyer · 2025-12-17T22:18:01 1766009881

In my own, very anecdotal, experience, Gemini 3 Pro and Flash are both more reliably accurate than GPT 5.x.

I have not worked with Sonnet enough to give an opinion there.

freedomben · 2025-12-17T17:39:22 1765993162

Cool! I've been using 2.5 flash and it is pretty bad. 1 out of 5 answers it gives will be a lie. Hopefully 3 is better

samyok · 2025-12-17T17:46:54 1765993614

Did you try with the grounding tool? Turning it on solved this problem for me.

Davidzheng · 2025-12-17T17:50:19 1765993819

what if the lie is a logical deduction error not a fact retrieval error

rat9988 · 2025-12-17T18:14:45 1765995285

The error rate would still be improved overall and might make it a viable tool for the price depending on the usecase.

tonymet · 2025-12-17T20:00:18 1766001618

Can you be more specific on the tasks you’ve found exceptional ?

encroach · 2025-12-17T18:47:50 1765997270

How did you get early access?

yunohn · 2025-12-17T21:53:00 1766008380

I love how every single LLM model release is accompanied by pre-release insiders proclaiming how it’s the best model yet…

hexasquid · 2025-12-18T00:17:06 1766017026

Make me think of how every iPhone is the best iPhone yet.

Waiting for Apple to say "sorry folks, bad year for iPhone"

moffkalast · 2025-12-17T21:44:26 1766007866

Should I not let the "Gemini" name fool me either?

tonyhart7 · 2025-12-17T20:07:17 1766002037

I think google is the only one that still produce general knowledge LLM right now

claude is coding model from the start but GPT is in more and more becoming coding model

Imustaskforhelp · 2025-12-17T20:21:11 1766002871

I agree with this observation. Gemini does feel like code-red for basically every AI company like chatgpt,claude etc. too in my opinion if the underlying model is both fast and cheap and good enough

I hope open source AI models catch up to gemini 3 / gemini 3 flash. Or google open sources it but lets be honest that google isnt open sourcing gemini 3 flash and I guess the best bet mostly nowadays in open source is probably glm or deepseek terminus or maybe qwen/kimi too.

Uehreka · 2025-12-17T20:58:31 1766005111

I would expect open weights models to always lag behind; training is resource-intensive and it’s much easier to finance if you can make money directly from the result. So in a year we may have a ~700B open weights model that competes with Gemini 3, but by then we’ll have Gemini 4, and other things we can’t predict now.

xbmcuser · 2025-12-17T21:35:38 1766007338

There will be diminishing returns though as the future models won't be thah much better we will reach a point where the open source model will be good enough for most things. And the need for being on the latest model no longer so important.

For me the bigger concern which I have mentioned on other AI related topics is that AI is eating all the production of computer hardware so we should be worrying about hardware prices getting out of hand and making it harder for general public to run open source models. Hence I am rooting for China to reach parity on node size and crash the PC hardware prices.

FuckButtons · 2025-12-17T22:08:41 1766009321

I had a similar opinion, that we were somewhere near the top of the sigmoid curve of model improvement that we could achieve in the near term. But given continued advancements, I’m less sure that prediction holds.

baq · 2025-12-17T22:13:28 1766009608

If Gemini 3 flash is really confirmed close to Opus 4.5 at coding and a similarly capable model is open weights, I want to buy a box with an usb cable that has that thing loaded, because today that’s enough to run out of engineering work for a small team.

leemoore · 2025-12-17T21:52:20 1766008340

Gemini isn't code red for Anthropic. Gemini threatens none of Anthropic's positioning in the market.

ralusek · 2025-12-17T21:58:26 1766008706

Yes it does. I never use Claude anymore outside of agentic tasks.

leemoore · 2025-12-17T23:44:42 1766015082

What demographic are you in that is leaving anthropic in mass that they care about retaining? From what I see Anthropic is targeting enterprise and coding.

Claude Code just caught up to cursor (no 2) in revenue and based on trajectories is about to pass GitHub copilot (number 1) in a few more months. They just locked down Deloitte with 350k seats of Claude Enterprise.

In my fortune 100 financial company they just finished crushing open ai in a broad enterprise wide evaluation. Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.

There is 1 leader with enterprise. There is one leader with developers. And google has nothing to make a dent. Not Gemini 3, not Gemini cli, not anti gravity, not Gemini. There is no Code Red for Anthropic. They have clear target markets and nothing from google threatens those.

user34283 · 2025-12-18T01:00:21 1766019621

Enterprise is slow. As for developers, we will be switching to Google unless the competition can catch up and deliver a similarly fast model.

Enterprise will follow.

I don't see any distinction in target markets - it's the same market.

Karrot_Kream · 2025-12-18T00:14:28 1766016868

I agree with your overall thesis but:

> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.

Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.

siva7 · 2025-12-17T22:41:08 1766011268

so? agentic tasks is where the promised agi is for many of us

Workaccount2 · 2025-12-17T21:18:00 1766006280

Open source models are riding coat tails, they are basically just distilling the giant SOTA models, hence perpetually being 4-6mos behind.

waffletower · 2025-12-17T21:53:55 1766008435

If this quantification of lag is anywhere near accurate (it may be larger and/or more complex to describe), soon open source models will be "simply good enough". Perhaps companies like Apple could be 2nd round AI growth companies -- where they market optimized private AI devices via already capable Macbooks or rumored appliances. While not obviating cloud AI, they could cheaply provide capable models without subscription while driving their revenue through increased device sales. If the cost of cloud AI increases to support its expense, this use case will act as a check on subscription prices.

Gigachad · 2025-12-17T21:29:26 1766006966

So basically the proprietary models are devalued to almost 0 in about 4-6 months. Can they recover the training costs + profit margin every 4 months?

Workaccount2 · 2025-12-17T21:16:39 1766006199

Coding is basically an edge case for LLMs too.

Pretty much every person in the first (and second) world is using AI now, and only small fraction of those people are writing software. This is also reflected in OAI's report from a few months ago that found programming to only be 4% of tokens.

int_19h · 2025-12-17T22:48:03 1766011683

That may be so, but I rather suspect the breakdown would be very different if you only count paid tokens. Coding is one of the few things where you can actually get enough benefit out of AI right now to justify high-end subscriptions (or high pay-per-token bills).

aleph_minus_one · 2025-12-17T21:59:42 1766008782

> Pretty much every person in the first (and second) world is using AI now

This sounds like you live in a huge echo chamber. :-(

lukan · 2025-12-17T22:58:30 1766012310

Depends what you count as AI (just googling makes you use the LLM summary), but also my mother who is really not tech affine loved what google lense can do, after I showed her.

Apart from my very old grandmothers, I don't know anyone not using AI.

pests · 2025-12-17T23:10:40 1766013040

How many people do you know? Do you talk to your local shop keeper? Or the clerk at the gas station? How are they using AI? I'm a pretty techy person with a lot of tech friends, and I know more people not using AI (on purpose, or lack of knowledge) then do.

lukan · 2025-12-17T23:20:07 1766013607

Hm, quite some. Like I said, it depends what you count as AI.

Just googling means you use AI nowdays.

dfsegoat · 2025-12-17T22:22:21 1766010141

> it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high

...and all of that done without any GPUs as far as i know! [1]

[1] - https://www.uncoveralpha.com/p/the-chip-made-for-the-ai-infe...

(tldr: afaik Google trained Gemini 3 entirely on tensor processing units - TPUs)

jauntywundrkind · 2025-12-17T17:13:11 1765991591

Just to point this out: many of these frontier models cost isn't that far away from two orders of magnitude more than what DeepSeek charges. It doesn't compare the same, no, but with coaxing I find it to be a pretty capable competent coding model & capable of answering a lot of general queries pretty satisfactorily (but if it's a short session, why economize?). $0.28/m in, $0.42/m out. Opus 4.5 is $5/$25 (17x/60x).

I've been playing around with other models recently (Kimi, GPT Codex, Qwen, others) to try to better appreciate the difference. I knew there was a big price difference, but watching myself feeding dollars into the machine rather than nickles has also founded in me quite the reverse appreciation too.

I only assume "if you're not getting charged, you are the product" has to be somewhat in play here. But when working on open source code, I don't mind.

happyopossum · 2025-12-17T18:35:29 1765996529

Two orders of magnitude would imply that these models cost $28/m in and $42/m out. Nothing is even close to that.

minraws · 2025-12-17T21:23:25 1766006605

Gpt 5.2 pro is well beyond that iirc

jauntywundrkind · 2025-12-17T20:03:18 1766001798

To me as an engineer, 60x for output (which is most of the cost I see, AFAICT) is not that significantly different from 100x.

I tried to be quite clear with showing my work here. I agree that 17x is much closer to a single order of magnitude than two. But 60x is, to me, a bulk enough of the way to 100x that yeah I don't feel bad saying it's nearly two orders (it's 1.78 orders of magnitude). To me, your complaint feels rigid & ungenerous.

My post is showing to me as -1, but I standby it right now. Arguing over the technicalities here (is 1.78 close enough to 2 orders to count) feels besides the point to me: DeepSeek is vastly more affordable than nearly everything else, putting even Gemini 3 Flash here to shame. And I don't think people are aware of that.

I guess for my own reference, since I didn't do it the first time: at $0.50/$3.00 / M-i/o, Gemini 3 Flash here is 1.8x & 7.1x (1e1.86) more expensive than DeepSeek.

KoolKat23 · 2025-12-17T22:30:28 1766010628

I struggle to see the incentive to do this, I have similar thoughts for locally run models. It's only use case I can imagine is small jobs at scale perhaps something like auto complete integrated into your deployed application, or for extreme privacy, honouring NDA's etc.

Otherwise, if it's a short prompt or answer, SOTA (state of the art) model will be cheap anyway and id it's a long prompt/answer, it's way more likely to be wrong and a lot more time/human cost is spent on "checking/debugging" any issue or hallucination, so again SOTA is better.

lukan · 2025-12-17T23:03:50 1766012630

"or for extreme privacy"

Or for any privacy/IP protection at all? There is zero privacy, when using cloud based LLM models.

Workaccount2 · 2025-12-17T23:10:21 1766013021

Really only if you are paranoid. It's incredibly unlikely that the labs are lying about not training on your data for the API plans that offer it. Breaking trust with outright lies would be catastrophic to any lab right now. Enterprise demands privacy, and the labs will be happy to accommodate (for the extra cost, of course).