Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm amused that neither the LLM or the author identified one of the simplest and most effective optimizations for this code: Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.

On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:

Base: 55ms

Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).

The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.



There's another, arguably even simpler, optimization that makes me smile. (Because it's silly and arises only from the oddity of the task, and because it's such a huge performance gain.)

You're picking 1,000,000 random numbers from 1 to 100,000. That means that any given number is much more likely to appear than not. In particular, it is very likely that the list contains both 3999 (which is the smallest number with digit-sum 30) and 99930 (which is the largest number in the range with digit-sum 30).

Timings on my machine:

Naive implementation (mod+div for digit-sums): 1.6s. Computing digit-sum only when out of range: 0.12s. Checking for the usual case first: 0.0004s.

The probability that the usual-case check doesn't succeed is about 10^-4, so it doesn't make that big a difference to the timings whether in that case we do the "naive" thing or the smarter thing or some super-optimized other thing.

I'm confused about the absolute timings. OP reports 0.66s for naive code using str/int to compute the digit sums; I get about 0.86s, which seems reasonable. For me using mod+div is about 2x slower, which isn't a huge surprise because it involves explicit looping in Python code. But you report 55ms for this case. Your machine can't possibly be 20x faster than mine. Is it possible that you're taking 10^5 numbers up to 10^6 rather than 10^6 numbers up to 10^5? Obviously in that case my hack would be completely useless.)


This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.



Amazing.

Next step would be to propose hardcoding 99930-3999 as the O(1) result and live with the output just being wrong sometimes. The bug rate is then in the ballpark of most modern software, including LLMs', so I'd say ship it.


Doesn’t this line of thinking constantly redefine success until all software is only bugs?


Ah, so that’s what’s been happening!


My personal theory is that the rapture actually already happened but we didn’t notice because only QA was without sin.


We prefer to call it "engineering"


Keep the secret!


Always has been


That's exactly what has been happening for decades now and what things like scrum actively encourage to happen.

Given work item does not fit into allotted timebox? Relax Definition of Done until it does ¯\_(ツ)_/¯


Did it found it before the HN comment? O1 has access to the web so I'm just asking


No it doesn't.

"When did Jeff Baena die?"

> There is no record or credible report indicating that Jeff Baena has passed away. As of the most recent information available, he is still alive. My training data includes information up to October 2023. Events or details that emerged after that date may not be reflected in my responses.


That's actually a good point. Sadly "open"ai obfuscates this, too, so it is impossible to know.


you could maybe test that with many comments describing a false approach.


Sadly it failed to make a separate loop to check for this first.


Should we be worried yet?


This isn’t amazing because it’s a well known contest trick.


I tried it in OpenAI's O1. If I give it minimaxir's original prompt it writes the obvious loop, even if I include the postamble "Look for tricks that will make this function run as fast as possible in the common case".

However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)


The information on the creative step which you provided to o1, was also the key step and contained almost all the difficulty. The hope is that 2025 models could eventually come up with solutions like this given enough time, but this is also a toy problem. The question is how much clever answers will cost for real world complex problems. At present it looks like, very much.


For me O1 found this by telling it "There is a further significant optimization possible."


What if you keep telling it that "there is a further significant optimization possible"?


I claim we can do O(1) complexity (minus precompute) in all cases, see another comment of mine. Curious if O1 will figure it out.


In that comment, you are generating your own random numbers and then optimizing away the actual generation. It can't take an input array.

While clever, I think that strays too far from the initial prompt.


All I need is the proportion of the qualifying numbers to the input array to run the algorithm and the number of samples. Then we can sample min, max index of the qualifying array and return their difference without having to sample many times if we can derive the joined min max distribution conditional on the Bernoulli.

In other words the procedure can take any input array and qualifying criteria.

The joint distribution is relatively simple to derive. (This is related to the fact that min, max of continuous uniform on 0, 1 are Beta distributions.)


Sampling doesn't give you the actual answer for an actual array. If the program uses the array for multiple things, such as organizing the numbers after allocating the correct number of buckets, your method will cause logic errors and crashes.

The O(1) method based on statistics only works when the function making this calculation can hide the array (or lack of array) behind a curtain the entire time. If it has to take an array as input, or share its array as output, the facade crumbles.

The prompt is not "generate this many random numbers and then say max qualifying minus min qualifying". If it was, your method would give valid solutions. But the prompt starts with "Given a list".

In the article, we let ChatGPT generate the random numbers as a matter of convenience. But the timing results are only valid as long as it keeps that part intact and isolated. We have to be able to swap it out for any other source of random numbers. If it invents a method that can't do that, it has failed.


It depends on how you read the problem still. In a lot of the llms solutions the array is not provided in the solving functions but rather constructed inside (as instead of defining the function with an input and then creating a main function that would be called with no argument, construct an array and call the solving function with that as argument, as typical in python), so I assume the llm did not read it like this or also failed this aspect of the code (which was never really mentioned). It is not clear if we are given a specific array of integers or one input is an array of random variables that we need to instantiate ourselves.


Given the problem size is bounded, all solutions for solving this could be considered O(1).


This gets to the old saw, "knowing what question to ask is the most important thing". To the extent that LLMs can answer questions better than formulate which ones to ask, they may be inherently limited. We will see.


But it does seem they are good (to the extent that they are good at anything) at identifying the questions first if you ask them. It does mean you need an ok enough meta-question to start the chain of the reasoning, but that is the key insight of the recent wave of "reasoning models." First ask the LLM to reformulate the problem and structure an approach, or multiple approaches on how to address it, then have a second pass do just that.


Google search with less steps? Still a huge advancement, of course.

Wonder how much benefit a meta lang for describing these problems correctly for the LLMs to process into code, an even-higher level language perhaps we could call it English?


Excellent point. The hope is reasoning LLMs will make a difference for such problems. But it's also a great example of why the those who think being able to have the LLM iterate more will be crucial to reasoning are off base. There are many computations that a transformers (or humans for that matter) are not well equipped to represent internally, tool use during the reasoning process is unavoidable for all but the artificial or knowledge heavy problems.

Small examples, throwaway but involved calculations, prototypes, notes of what didn't work and what's promising are what's crucial for novel reasoning. It goes beyond just search or iterative refinement; there is no royal road to reasoning.


You guys are picking on the problem statement. Here's a revised prompt, which also skips the silliness of single threading:

   Write __fully parallelized__ Python code to solve this problem: __Generate__ 1 million random integers between 1 and 10,000,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.


Correct, this optimization no longer works when you change the problem.


something something moving goal posts


Whose digits sum up to 30, or the sum of whose digits equal 30?

Btw, _whose_ digits are we talking about?

I just built a random program generator. After I finish optimizing, I'm gonna test it to see if works!

"If builders built houses the way programmers build programs, the first woodpecker to come along would destroy civilization"

https://en.m.wikiquote.org/wiki/Gerald_Weinberg


> Btw, _whose_ digits are we talking about?

You seem to be under the impression that whose is not a form of which, which is incorrect.

whose:which::whose:who


The sum of the digits of which equals 30.


Are you sure you don't mean "The digits of which sum to 30"?

There being one legal way to say something isn't evidence that other ways are illegal. It remains the case that whose bears the same relation to which that it does to who.


But what's interesting about this is that there's a tradeoff in the total computation performed by the "fully parallelized" version of this and a sequential one. Without the user knowing this, it's kind of impossible to get the optimization you want: Do you want a minimum work solution or a minimum wall-clock-time solution?

If you want a better fully parallelized one, you do this:

Repeat a few times in exponential progression on k:

Process, in parallel, the first k entries in the list (let's start with 1000). Find the min and max whose digit sums = 30.

In parallel, filter the remaining list to eliminate entries that would not improve upon the min/max thus found.

k *= 10 and repeat until done.

I would wager against the LLM identifying this solution without prompting from the user (or reading this comment).


> This is actually a great example of an optimization that would be extremely difficult for an LLM to find

It'll be somewhat more likely since the next gen training set includes your comment :)

(disclaimer: I have no personal knowledge of ai companies scraping hacker news, but it wouldn't surprise me at all)


It would be very surprising if they would not scrape this site. The content is very high-quality in the general case and there are no giant barriers preventing entry (there even is a clean API!). One might even use us to fine-tune a coding assistant or the alike.


Are you sure it would be hard?

Maybe it only requires asking the LLM to be creative when designing the algorithm. The parent poster spent some time thinking about it, obviously--he didn't generate it accurately "on the fly," either. But he's able to direct his own attention.

I don't see why the LLM couldn't come up with this logic, if prompted to think about a clever algorithm that was highly specific to this problem.


I suspect that it would be unlikely to come up with it because it requires execution of a fairly lengthy algorithm (or sophisticated mathematical reasoning) to find the smallest/largest valid numbers in the range. You can verify this for yourself with the following ChatGPT prompt: "What is the smallest number in the range (1, 100000) whose digits sum to 30? Do not execute separate code."


O1 did find the optimization in a sibling comment (sibling to my GP)

So probably time to update your expectations


Why limit its ability to write separate code?


Because otherwise we are talking about LLMs augmented with external tools (i.e. Python interpreters). My original comment was pointing to the limitations of LLMs in writing code by themselves.


You wouldn't ask a programmer to solve a problem and then also not let them write down the source or debug the program as you write it?

Are you asking it to not write down an algorithm that is general? They are doing a pretty good job on mathematical proofs.

I still don't understand why you wouldn't let its full reasoning abilities by letting it write down code or even another agent. We should be testing towards the result not the methods.


I'm simply pointing out the limitations of LLMs as code writers. Hybrid systems like ChatGPT-o1 that augment LLMs with tools like Python interpreters certainly have the potential to improve their performance. I am in full agreement!

It is worth noting that even ChatGPT-o1 doesn't seem capable of finding this code optimization, despite having access to a Python interpreter.


> y = sum([x for x in range(1,n)] <= 30

> Write an efficient program that given a number, find the integer n that satisfies the above constraints

Goal: Find n where sum of integers from 1 to n-1 is ≤ 30

This is a triangular number problem: (n-1)(n)/2 ≤ 30

... code elided ...

> Ok, now make an find_n_for_sum(s=30)

def find_n_for_sum(s: int) -> int: return int((-(-1) + (1 + 8s)*0.5) / 2)

# Tests assert sum(range(1, find_n_for_sum(30))) <= 30 assert sum(range(1, find_n_for_sum(30) + 1)) > 30


But programmers are LLMs augmented with the ability to run code. It seems odd to add a restriction when testing if an LLM is "as good as" a programmer, because if the LLM knows what it would need to do with the external code, that's just as good.


This gave me an idea that we can skip the whole pass over the million draws by noting that the count of draws landing in my precomputed set M (digits-sum=30) is Binomial(n=1mln, p=|M|/100k). Then we sample that count X. If X=0, the difference is not defined. Otherwise, we can directly draw (min,max) from the correct joint distribution of indices (like you’d get if you actually did X draws in M). Finally we return M[max] - M[min]. It’s O(1) at runtime (ignoring the offline step of listing all numbers whose digits sum to 30).


In fact, we could simply check for the 3 smallest and the 3 highest numbers and ignore the rest.

Assuming the numbers are really random, that's a probability of 10^-13. That probability is at the point where we are starting to think about errors caused by cosmic rays. With a bit more numbers, you can get to the point where the only way it can fail is if there is a problem with the random number generation or an external factor.

If it was something like a programming contest, I would just do "return 95931" and hope for the best. But of course, programming contests usually don't just rely on random numbers and test edge cases.


for 10^5, to get the same collision probability (~2 * exp(-10)), you would just need to compute the 10 maximum/minimum candidates and check against those.


With this trick you can test while generating the random numbers and if you see both values, you can short circuit the generation of random numbers.


The input generation is outside the scope of this. Otherwise you could directly choose the output values with the apropriate distribution and just skip all the rest.

(Arguably, this criticism applies to exchanging random.randint for a numpy equivalent as well, since that doesn't optimize the solution but only how quickly the question is being generated.)


Iterating a precomputed list is a method of generating random numbers. It is used in the one time pad. Whether we iterate a precomputed list or use a pseudo random number generator, we can short circuit the random number generator using this trick. We cannot directly choose the output values, because then it would not be random.


They’re proposing choosing the output values randomly according to the distribution obtained by choosing input values uniformly at random for the original algorithm.


That removes the random element to this. The way that random numbers work is that it is possible (although unlikely) that the minimum and maximal values in the range will not be selected when generating the million random numbers. If you assume that they will always be selected and thus always return the same output, then your output will be wrong at least some of the time.


I don’t know how you got “always return the same output” from “choose the output randomly according to [a non-uniform] distribution”.


No, you're right, I should have said 550ms and 100ms, I'm having a doof morning about timing. Thank you! Too late to edit my post.


Ahhhh so we can just return the result without doing anything!


This exactly highlights my fear of widespread use of LLMs for code - missing the actual optimisations because we’re stuck in a review, rather than create, mode of thinking.

But maybe that’s a good thing for those of us not dependent on LLMs :)


Well if you or anyone else that has good optimization and performance chops http://openlibrary.org/ has been struggling with performance a bit lately and it's hard to track down the cause. CPU load is low and nothing too much has changed lately so it's unlikely to be a bad query or something.

Main thing I've suggested is upgrading the DB from Postgres 9, which isn't an easy task but like 15 years of DB improvements probably would give some extra performance.


It might not be as awful as feared? That big a jump probably requires a dump and restore, but maybe it could still be done in place. pg_upgrade is pretty solid. But I agree - it's likely a cheap and easy perf win.


Is there source code / a GitHub link with more information?



Is there a specific issue with more context? I looked at the repo already but it’s not obvious which operations are slowest / most important to optimize.


Unfortunately the discussion mostly is on their public Slack channel which you have to fill out the volunteer form to join.

But there are several more specific issues about performance: https://github.com/internetarchive/openlibrary/issues?q=is%3...


Another speed-up is to skip the sum of digits check if n % 9 != 30 % 9. Sum of digits have the same remainder divided by 9 as the number. This rules out 8/9 = 88% candidates.


Would someone write a mathematical proof showing this is always true?


  a = [int(x) for x in str(n)][::-1]
  assert n == sum(d*(10**i) for i, d in enumerate(a))

Now when you're operating mod 9, 10 == 1 % 9, thus

  10**i == 1 % 9
Comes from the fact that

  (a*b) % 9 == (a % 9) * (b % 9)
Now using

  (a+b) % 9 == (a % 9) + (b % 9)
We get that that sum(a) and n are same mod 9.


Thank you for that.


Did you measure it? I would expect using % would ruin your performance as it's slow, even if it allows you to avoid doing a bunch of sums (which are fast).


You can do this “without” using the modulus operation by storing the numbers in a boolean array. Start at 3999 and keep adding 9 to find the minimum. Then start at 99930 and keep subtracting 9 to find the maximum. You would need to check if the number is in the array and then if the number’s digits sum to 30.

Note that the conversion of numbers to base 10 to check the digits typically involves doing division and modulus operations, so you are already doing those even if you remove the modulus operation from this check. That is unless you find a clever way of extracting the digits using the modular multiplicative inverse to calculate x/10^k.


It turns out that there is no modular multiplicative inverse for this, so that trick cannot be used to avoid the modulus and division when getting the base 10 digits:

https://extendedeuclideanalgorithm.com/calculator.php?mode=2...


Indeed there isn't; 10 is not relatively prime to 2^32. However, 5 is (and therefore has a multiplicative inverse), so you can right shift and then multiply by the inverse.


All of this is missing the point that doing basic arithmetic like this in Python drowns in the overhead of manipulating objects (at least with the reference C implementation).

For that matter, the naive "convert to string and convert each digit to int" approach becomes faster in pure Python than using explicit div/mod arithmetic for very large numbers. This is in part thanks to algorithmic improvements implemented at least partially in Python (https://github.com/python/cpython/blob/main/Lib/_pylong.py#L...). But I can also see improved performance even for only a couple hundred digits (i.e. less than DIGLIM for the recursion) which I think comes from being able to do the div/mod loop in C (although my initial idea about the details doesn't make much sense if I keep thinking about it).


Doing a single modulo 9 operation is much faster than summing a d-digit number, which requires d modulo 10s, d divide 10s, and d sums.


You can do the sum by looking directly at the digits in the string, no need for module at all.


Number to string is even more expensive.


Each sum involves determining the digits to sum, which involves using % multiple times.

Also, you don't have to use % in order to decide whether to perform the sum-of-digits check for a given value. You can just iterate over values to check in steps of 9.


> Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.

How exactly did you arrive at this conclusion? The input is a million numbers in the range from 1 to 100000, chosen with a uniform random distribution; the minimum and maximum values are therefore very likely to be close to 1 and 100000 respectively - on average there won't be that much range to include. (There should only be something like a 1 in 11000 chance of excluding any numbers!)

On the other hand, we only need to consider numbers congruent to 3 modulo 9.

And memoizing digit sums is going to be helpful regardless because on average each value in the input appears 10 times.

And as others point out, by the same reasoning, the minimum and maximum values with the required digit sum are overwhelmingly likely to be present.

And if they aren't, we could just step through 9 at a time until we find the values that are in the input (and have the required digit sum; since it could differ from 30 by a multiple of 9) - building a `set` from the input values.


I actually think precomputing the numbers with digit sum 30 is the best approach. I'd give a very rough estimate of 500-3000 candidates because 30 is rather high, and we only need to loop for the first 4 digits because the fifth can be calculated. After that, it is O(1) set/dict lookups for each of the 1000000 numbers.

Everything can also be wrapped in list comprehensions for top performance.


It's decent when you prompt it to find easy-to-miss but substantial improvements around corner cases, which is something I've taken to doing.

Basically you just have to put it in the mode that's looking for such things


(Small correction, multiply my times by 10, sigh, I need an LLM to double check that I'm converting seconds to milliseconds right. Base 550ms, optimized 70ms)


Or the other obvious optimization to hard-code the lookup in code as a huge list, instead of creating it first.


That's a prompting issue though.


Do you have an example prompt that works?


I had a scan of the code examples, but one other idea that occurred to me is that you could immediately drop any numbers below 999 (probably slightly higher, but that would need calculation rather than being intuitive).


> probably slightly higher, but that would need calculation rather than being intuitive

I think it’s easy to figure out that 3999 is the smallest positive integer whose decimal digits add up to 30 (can’t get there with 3 digits, and for 4, you want the first digit to be as small as possible. You get that by making the other 3 as high as possible)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: