SpeechPrompt

  • Overview
  • SpeechPrompt v1
  • SpeechPrompt v2
  • SpeechGen
  • SpeechPrompt Journal

Make Spoken Language Models Versatile!

Prompt Tuning for Speech Processing Tasks


  • SpeechPrompt v1
  • SpeechPrompt v2
  • SpeechGen
  • SpeechPrompt Journal

NEWS

  • [April 2025] The SpeechPrompt journal paper will be presented at ICASSP 2025 in Hyderabad.
  • [Aug. 2024] SpeechPrompt is published as a Journal paper on IEEE/ACM TASLP [link]
  • [June 2023] SpeechGen paper is available [link]
  • [May. 2023] SpeechPrompt v2 code released [link]
  • [March 2023] SpeechPrompt v2 paper is available. [link]
  • [Oct. 2022] Website of SeechPrompt project is borned
  • [Oct. 2022] "SpeechPrompt" is an research topic in the JSALT workshop [website]
  • [June 2022] SpeechPrompt v1 is accepted at INTERSPEECH 2022 [paper] [code]
  • [March 2022] SpeechPrompt v1 is available on arXiv [link]

Citation (Journal Paper)

You can cite our work using the following BibTeX entry:

			  
@ARTICLE{10620644,
author    = {Chang, Kai-Wei and Wu, Haibin and Wang, Yu-Kai and Wu, Yuan-Kuei and Shen, Hua and Tseng, Wei-Cheng and Kang, Iu-Thing and Li, Shang-Wen and Lee, Hung-Yi},
journal   = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title     = {SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks},
year      = {2024},
volume    = {32},
pages     = {3730--3744},
doi       = {10.1109/TASLP.2024.3436618}
}
				
					

DOI: 10.1109/TASLP.2024.3436618

Introduction

Self-supervised learning (SSL) has revolutionized the field of computer vision (CV), natural language processing (NLP), and speech processing. By pre-training a model on a large amount of unlabeled data in a self-supervised manner, the model can learn universal representations that benefit downstream tasks.

However, to utilize these SSL models for downstream tasks, we usually follow the "pre-train, fine-tune paradigm". That is, we need to (1) design a downstream model, (2)fine-tune the model, and (3) store the parameters of the model. This causes a lot of computation and storage costs.

On the other hand, the "prompting paradigm" has been widely used in the NLP field. By leveraging the pre-trained language model's (LM) knowledge, prompt tuning optimizes a limited number of parameters for downstream tasks. Prompt tuning can serve a large number of downstream tasks in a unified manner with computation and storage efficiency.

However, the prompting paradigm has never been explored in the speech processing before. Recently, various spoken language models have been developed , which opens the door to apply prompt tuning for speech processing tasks ...