I don't think that text to movement will be desired. There is a lot of room for interpretation there, and if a creator already has a vision for movement, they would probably choose to have the model mimic a video of them acting rather than trying to get lucky and correctly describe a set of movements.
To the contrary, I think text-to-movement is going to be huge for videogames especially.
I don't see any other way to smoothly link 1,000 possible movements to and from each other, including when there are various fixed distances between you and a ladder or ledge etc.
I think models will learn "movement personalities" the same way they learn a particular celebrity's voice -- everybody moves with different rhythms. So your big burly action-hero character will move with a totally different rhythm from your waif-thin ethereal elf.
But there will still be a textual vocabluary that generates the motion -- "stealthily creeps to the door 2.2 meters away, taking 6.3 seconds, and then suddenly and dramatically opens it with a flourish".
Inverse kinematics have been solving most of the animation blending you're describing here for years.
Not to dismiss completely the potential of this, but there already pretty good, reliable ways to solve these issues.
I believe the person you answered to to be correct in most cases, there is a lot of nuance lost if done by text.
Text to movement is highly desirable. There is a long way to go before tools like this are used in AAA games, and blockbuster films, but as much money as is tied up in those areas, even more is tied up in advertising and lower tier content production pipelines.
Look at tools like Adobe Animate and Character Animator, these tools are absolutely capable of doing interpolation between frames while giving the person operating the tools a fine degree of control. When you can use text to movement to quickly create tens or hundreds of sample scenes quickly, then manually iterate on and edit them, instead of hand drawing and compositing, the value proposition is pretty crazy for time and budget constrained production where tradeoffs on quality and fidelity are acceptable.