Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.



Amazing.

Next step would be to propose hardcoding 99930-3999 as the O(1) result and live with the output just being wrong sometimes. The bug rate is then in the ballpark of most modern software, including LLMs', so I'd say ship it.


Doesn’t this line of thinking constantly redefine success until all software is only bugs?


Ah, so that’s what’s been happening!


My personal theory is that the rapture actually already happened but we didn’t notice because only QA was without sin.


We prefer to call it "engineering"


Keep the secret!


Always has been


That's exactly what has been happening for decades now and what things like scrum actively encourage to happen.

Given work item does not fit into allotted timebox? Relax Definition of Done until it does ¯\_(ツ)_/¯


Did it found it before the HN comment? O1 has access to the web so I'm just asking


No it doesn't.

"When did Jeff Baena die?"

> There is no record or credible report indicating that Jeff Baena has passed away. As of the most recent information available, he is still alive. My training data includes information up to October 2023. Events or details that emerged after that date may not be reflected in my responses.


That's actually a good point. Sadly "open"ai obfuscates this, too, so it is impossible to know.


you could maybe test that with many comments describing a false approach.


Sadly it failed to make a separate loop to check for this first.


Should we be worried yet?


This isn’t amazing because it’s a well known contest trick.


I tried it in OpenAI's O1. If I give it minimaxir's original prompt it writes the obvious loop, even if I include the postamble "Look for tricks that will make this function run as fast as possible in the common case".

However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)


The information on the creative step which you provided to o1, was also the key step and contained almost all the difficulty. The hope is that 2025 models could eventually come up with solutions like this given enough time, but this is also a toy problem. The question is how much clever answers will cost for real world complex problems. At present it looks like, very much.


For me O1 found this by telling it "There is a further significant optimization possible."


What if you keep telling it that "there is a further significant optimization possible"?


I claim we can do O(1) complexity (minus precompute) in all cases, see another comment of mine. Curious if O1 will figure it out.


In that comment, you are generating your own random numbers and then optimizing away the actual generation. It can't take an input array.

While clever, I think that strays too far from the initial prompt.


All I need is the proportion of the qualifying numbers to the input array to run the algorithm and the number of samples. Then we can sample min, max index of the qualifying array and return their difference without having to sample many times if we can derive the joined min max distribution conditional on the Bernoulli.

In other words the procedure can take any input array and qualifying criteria.

The joint distribution is relatively simple to derive. (This is related to the fact that min, max of continuous uniform on 0, 1 are Beta distributions.)


Sampling doesn't give you the actual answer for an actual array. If the program uses the array for multiple things, such as organizing the numbers after allocating the correct number of buckets, your method will cause logic errors and crashes.

The O(1) method based on statistics only works when the function making this calculation can hide the array (or lack of array) behind a curtain the entire time. If it has to take an array as input, or share its array as output, the facade crumbles.

The prompt is not "generate this many random numbers and then say max qualifying minus min qualifying". If it was, your method would give valid solutions. But the prompt starts with "Given a list".

In the article, we let ChatGPT generate the random numbers as a matter of convenience. But the timing results are only valid as long as it keeps that part intact and isolated. We have to be able to swap it out for any other source of random numbers. If it invents a method that can't do that, it has failed.


It depends on how you read the problem still. In a lot of the llms solutions the array is not provided in the solving functions but rather constructed inside (as instead of defining the function with an input and then creating a main function that would be called with no argument, construct an array and call the solving function with that as argument, as typical in python), so I assume the llm did not read it like this or also failed this aspect of the code (which was never really mentioned). It is not clear if we are given a specific array of integers or one input is an array of random variables that we need to instantiate ourselves.


Given the problem size is bounded, all solutions for solving this could be considered O(1).


This gets to the old saw, "knowing what question to ask is the most important thing". To the extent that LLMs can answer questions better than formulate which ones to ask, they may be inherently limited. We will see.


But it does seem they are good (to the extent that they are good at anything) at identifying the questions first if you ask them. It does mean you need an ok enough meta-question to start the chain of the reasoning, but that is the key insight of the recent wave of "reasoning models." First ask the LLM to reformulate the problem and structure an approach, or multiple approaches on how to address it, then have a second pass do just that.


Google search with less steps? Still a huge advancement, of course.

Wonder how much benefit a meta lang for describing these problems correctly for the LLMs to process into code, an even-higher level language perhaps we could call it English?


Excellent point. The hope is reasoning LLMs will make a difference for such problems. But it's also a great example of why the those who think being able to have the LLM iterate more will be crucial to reasoning are off base. There are many computations that a transformers (or humans for that matter) are not well equipped to represent internally, tool use during the reasoning process is unavoidable for all but the artificial or knowledge heavy problems.

Small examples, throwaway but involved calculations, prototypes, notes of what didn't work and what's promising are what's crucial for novel reasoning. It goes beyond just search or iterative refinement; there is no royal road to reasoning.


You guys are picking on the problem statement. Here's a revised prompt, which also skips the silliness of single threading:

   Write __fully parallelized__ Python code to solve this problem: __Generate__ 1 million random integers between 1 and 10,000,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.


Correct, this optimization no longer works when you change the problem.


something something moving goal posts


Whose digits sum up to 30, or the sum of whose digits equal 30?

Btw, _whose_ digits are we talking about?

I just built a random program generator. After I finish optimizing, I'm gonna test it to see if works!

"If builders built houses the way programmers build programs, the first woodpecker to come along would destroy civilization"

https://en.m.wikiquote.org/wiki/Gerald_Weinberg


> Btw, _whose_ digits are we talking about?

You seem to be under the impression that whose is not a form of which, which is incorrect.

whose:which::whose:who


The sum of the digits of which equals 30.


Are you sure you don't mean "The digits of which sum to 30"?

There being one legal way to say something isn't evidence that other ways are illegal. It remains the case that whose bears the same relation to which that it does to who.


But what's interesting about this is that there's a tradeoff in the total computation performed by the "fully parallelized" version of this and a sequential one. Without the user knowing this, it's kind of impossible to get the optimization you want: Do you want a minimum work solution or a minimum wall-clock-time solution?

If you want a better fully parallelized one, you do this:

Repeat a few times in exponential progression on k:

Process, in parallel, the first k entries in the list (let's start with 1000). Find the min and max whose digit sums = 30.

In parallel, filter the remaining list to eliminate entries that would not improve upon the min/max thus found.

k *= 10 and repeat until done.

I would wager against the LLM identifying this solution without prompting from the user (or reading this comment).


> This is actually a great example of an optimization that would be extremely difficult for an LLM to find

It'll be somewhat more likely since the next gen training set includes your comment :)

(disclaimer: I have no personal knowledge of ai companies scraping hacker news, but it wouldn't surprise me at all)


It would be very surprising if they would not scrape this site. The content is very high-quality in the general case and there are no giant barriers preventing entry (there even is a clean API!). One might even use us to fine-tune a coding assistant or the alike.


Are you sure it would be hard?

Maybe it only requires asking the LLM to be creative when designing the algorithm. The parent poster spent some time thinking about it, obviously--he didn't generate it accurately "on the fly," either. But he's able to direct his own attention.

I don't see why the LLM couldn't come up with this logic, if prompted to think about a clever algorithm that was highly specific to this problem.


I suspect that it would be unlikely to come up with it because it requires execution of a fairly lengthy algorithm (or sophisticated mathematical reasoning) to find the smallest/largest valid numbers in the range. You can verify this for yourself with the following ChatGPT prompt: "What is the smallest number in the range (1, 100000) whose digits sum to 30? Do not execute separate code."


O1 did find the optimization in a sibling comment (sibling to my GP)

So probably time to update your expectations


Why limit its ability to write separate code?


Because otherwise we are talking about LLMs augmented with external tools (i.e. Python interpreters). My original comment was pointing to the limitations of LLMs in writing code by themselves.


You wouldn't ask a programmer to solve a problem and then also not let them write down the source or debug the program as you write it?

Are you asking it to not write down an algorithm that is general? They are doing a pretty good job on mathematical proofs.

I still don't understand why you wouldn't let its full reasoning abilities by letting it write down code or even another agent. We should be testing towards the result not the methods.


I'm simply pointing out the limitations of LLMs as code writers. Hybrid systems like ChatGPT-o1 that augment LLMs with tools like Python interpreters certainly have the potential to improve their performance. I am in full agreement!

It is worth noting that even ChatGPT-o1 doesn't seem capable of finding this code optimization, despite having access to a Python interpreter.


> y = sum([x for x in range(1,n)] <= 30

> Write an efficient program that given a number, find the integer n that satisfies the above constraints

Goal: Find n where sum of integers from 1 to n-1 is ≤ 30

This is a triangular number problem: (n-1)(n)/2 ≤ 30

... code elided ...

> Ok, now make an find_n_for_sum(s=30)

def find_n_for_sum(s: int) -> int: return int((-(-1) + (1 + 8s)*0.5) / 2)

# Tests assert sum(range(1, find_n_for_sum(30))) <= 30 assert sum(range(1, find_n_for_sum(30) + 1)) > 30


But programmers are LLMs augmented with the ability to run code. It seems odd to add a restriction when testing if an LLM is "as good as" a programmer, because if the LLM knows what it would need to do with the external code, that's just as good.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: