Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Context helps guessing what the next word will be.


With no significant background in ML-based TTS, I’m assuming that a larger context window would help with tone as well. “We are gathered here today to mourn the loss of…” really provides context into how the whole thing might sound, even if most of it is singing the praises of the deceased.


If you've been given 1000 characters (a fairly long paragraph) of text to read (and supposing you get to study them before you start speaking), is "guessing what the next word will be" all that relevant to decisions about intonation?


They in the right direction, but wrong specifics: you can't maintain consistent prosody without context.

The simplest examples are punctuation marks which change your speech before you reach the mark, but the problem extends past sentence boundaries

For example:

"He didn't steal the green car. He borrowed it."

vs

"He didn't steal the green car. He stole the red one."

A natural speaker would slightly emphasize steal and borrowed in the 1st example, but emphasize green and red in the 2nd.

Or like when you're building a set:

"Peter called Mary."

vs

"John called Mary. Peter called Mary. Who didn't call Mary?"

-

These all sound like small nits but for naively stitched together TTS, at best they nudge the narration towards the uncanny valley (which may be acceptable for some usecases)... but at worst they make the model sound broken.


> The simplest examples are punctuation marks which change your speech before you reach the mark, but the problem extends past sentence boundaries

I agree, but it seems unusual for this to matter past paragraph boundaries, and it sounds like there should be enough room for a full paragraph of context.


It depends on the level of quality you're going for, but prosody even changes based on how related the preceeding paragraph was to the upcoming one.

And the current SOTA for TTS includes breathing too, so you can't just put a fixed empty pause between your paragraphs.

People are chunking by paragraphs anyways (or even sentences) and it works, but the top commercial models support maintaining a context or passing in the most recently generated text for that reason.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: