SpeechGen
Unlocking the Generative Power of Speech Language Models with Prompts
Large language models (LLMs) have gained considerable attention for Artificial Intelligence Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct adaptation of continuous speech to LLMs that process discrete tokens remains an unsolved challenge, hindering the application of LLMs for speech generation. The advanced speech LMs are in the corner, as that speech signals encapsulate a wealth of information, including speaker and emotion, beyond textual data alone. Prompt tuning has demonstrated notable gains in parameter efficiency and competitive performance on some speech classification tasks. However, the extent to which prompts can effectively elicit generation tasks from speech LMs remains an open question. In this paper, we present pioneering research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen, with around 10M trainable parameters. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs, which will significantly enhance the capabilities of the framework.
Demo
Speech Translation
Spanish | English (Ground Truth) | SpeechGen |
---|---|---|
The origin of the name of the county is uncertain.
|
Origin of the name of the county is uncertain.
|
|
Lastly, the play will devote a reflection to the
relationship between art and rebellion.
|
And lastly the work will devote a reflection to the
relationship between art and rebellion.
|
|
It is around thirty kilometers away from the
regional capital city.
|
Just one hundred forty kilometers from the regional
capital.
|
|
They were easily recognized by the use of the armor
and the "Farina" helmet.
|
They were frequently recognized for the use of
armor and the cascade.
|
|
They played in cover bands but decided to create
their own music.
|
They played in mandates but they decided to create
their own music.
|
Speech Inpainting
Clean | Corrupted | SpeechGen |
---|---|---|
Would not the loftiest eminences of the city at least be visible?
|
Would not the loftiest eminences of the city at lea__ ___ __le?
|
Would not the loftiest eminences of the city at least be visible?
|
And the whole night the tree stood still and in
deep thought.
|
And the whole ___ the tree stood still and in
deep thought.
|
And the whole night the tree stood still and in
deep thought.
|
If it only were not so dark here and so terribly
lonely.
|
If it only were not so dark here and so terribly
lone__.
|
If it only were not so dark here and so terribly
lonely.
|
Speech Continuation
Conditional Ratio | Seed Segment | SpeechGen |
---|---|---|
0.25 |
He assassinated the president
|
He assassinated the president and
gave mister johnson the last charge of improvement in his writing possible
three point eight nine.
|
0.25 |
If anybody really wanted
|
If anybody really wanted to interfere once for some
time even if he got anything back to the other folk
|
0.25 |
Under standard procedures
|
Under standard procedures the receivers have
inspected trade here by this time have recently developed they fully
recognize did their business.
|
0.5 |
Who had been greatly upset by her experience was
able to view a
|
Who had been greatly upset by her experience was
able to view a longer time the great farce and receive good care of the
mutter.
|
0.5 |
Childless parents widows and helpless orphans
broken and controlled
|
Childless parents widows and helpless orphans
broken and controlled by the master and sentence pursuit life apt to
paradise.
|
0.5 |
But these king's witnesses were also put at times
into the press yard
|
But these king's witnesses were also put at times
into the press yard and charged with the service available on a second
charge to them.
|
0.75 |
And the obvious bulk of the package which he
intended to bring to work
|
And the obvious bulk of the package which he
intended to bring to work was confirmed
|
0.75 |
Then they set to building and began by bricking the
borders of the moat after which they proceeded
|
Then they set to building and began by bricking the
borders of the moat after which they proceeded to our own places
|
0.75 |
Still watching and waiting for the first chance
they ceased when the clerks
|
Still watching and waiting for the first chance
they ceased when the clerks left the office
|
Citation
@misc{wu2023speechgen, title={SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts}, author={Haibin Wu and Kai-Wei Chang and Yuan-Kuei Wu and Hung-yi Lee}, year={2023}, eprint={2306.02207}, archivePrefix={arXiv}, primaryClass={eess.AS} }