Comments on optimizations around string concatenation

jfk13 · on July 31, 2020

Would be nice if the title indicated this is specifically about Python. [edit: about CPython, actually]

judofyr · on July 31, 2020

I saw someone on Twitter complaining that this behavior can be confusing since without being aware of it you can create drastically slower code. However, I must say that I find this approach quite pragmatic and neat. In most cases, string concatenation is not going to be a performance issue, and then you never care about these details at all. If you're in a performance-oriented setting you need to be aware of this, but as long as it's well-understood and documented (which it unfortunately doesn't seem to be) then it's quite easy to make it fast.

The big advantage is that there is just a single "string" type. Compare this with Java or Rust where you have separate types for mutable strings (StringBuilder in Java; String in Rust) and immutable strings (String in Java; str in Rust). This is nice both for newcomers (which don't have to learn about all the details about how to optimize concatenation) and if you're just writing code which is never going to be in the hot path anyway.

I understand that in some settings (when you really care about performance) it's good to have an explicit difference between these, but I quite like the reduced complexity.

ChrisSD · on July 31, 2020

> Compare this with Java or Rust where you have separate types for mutable strings (StringBuilder in Java; String in Rust) and immutable strings (String in Java; str in Rust).

Not to detract from your point (which is broadly correct) but this isn't exactly right for Rust. Or at least isn't the right way to phrase it.

Rust has one string type, `str`, which can be referenced immutably (`&str`) or mutably `&mut str`. What `str` can't do is manage memory, which means it can't grow a string. `String` is more correctly a `StringBuffer` (technically a `Vec<u8>` albeit with additional checks). It can manage memory and therefore grow the size of the string, either by using spare capacity or by reallocating the memory. It is not really a string type in itself; it's just a memory buffer for holding a `str`.

This is all a long winded way of saying Rust separates managing memory from operating on the memory. This isn't the same as an mutable/immutable distinction although it is in some ways similar to growable/un-growable.

chrisseaton · on July 31, 2020

> In most cases, string concatenation is not going to be a performance issue, and then you never care about these details at all.

I would have thought if the performance of anything in a language like Python mattered, it'd be how quickly you can build up a text response, like a JSON document or rendered HTML?

In my implementation of Ruby I use immutable ropes for strings - so instead of mutable arrays of characters I have persistent trees of immutable arrays of characters. This makes a massive literal 10x impact on real-world code like template rendering.

jfkebwjsbx · on July 31, 2020

That depends on what you are building with the language. There are a lot of use cases outside web apps in Python. The majority of them, I'd argue.

judofyr · on July 31, 2020

This is actually a good point. If you're never accessing inside a string directly, but only iterating over it, then it's often far better to build a tree/rope structure than to create a continuous memory section.

jfkebwjsbx · on July 31, 2020

If you are really performance concerned, you wouldn't use Python to begin with.

That said, there are many use cases where performance does not matter as long as it is not "too much", for different values of "too much" depending on context.

edw · on July 31, 2020

Assembly code making poor choices can be outperformed by Python code making smart algorithmic choices. The sentiment behind "don't use Python for performance critical code" isn't wrong, but there's nuance. Programmers should make informed choices about space and time complexity regardless of the language being used.

Your typical assembly programmer may be far more aware of their obligation to do so than your typical Python programmer, so in my mind it's more important that languages occupying Python's ecological niche behave predictably. It can be challenging to balance that need against other constraints like limiting the number of abstractions someone needs to master in order to be productive.

yongjik · on July 31, 2020

Sure assembly is overkill for most tasks, but Python's performance is so poor that you can sometimes write a brute-force double loop on C++ and have it outperform anything in native Python.

Sometimes raw performance does save developer time, because you don't have to worry that much about algorithm. :)

edw · on July 31, 2020

Preaching to the choir. It amazes me that the ML and Data Science folks have settled on anti-FORTRAN. (Or maybe, put that way, it’s less amazing.)

Someone · on July 31, 2020

It may be confusing, but it applies to any dynamically-sized container (for mutable containers, the impact is smaller and often can be lessened by increasing the capacity of the container before appending items, but often, you don’t know the target capacity in advance). You’ll have to learn it, or use a language that doesn’t have dynamically-sized containers such as classical COBOL, FORTRAN or (to a lesser extent) Pascal.

So, this is a lesson you just have to learn at some time.

fovc · on July 31, 2020

It's crazy that even on the 'fast path' there are so many checks being run over and over. Provides an interesting alternative to the JavaScript Core article from a couple of days ago https://webkit.org/blog/10308/speculation-in-javascriptcore/

scott31 · on July 31, 2020

Article is good, but string concatenation is a code smell, and if you ever need to concatenate strings, you are probably doing something very wrong.

jstrong · on July 31, 2020

the phrase "rearranging deck chairs on the Titanic" comes to mind...