Skip to content

Commit d7d6b65

Browse files
committed
feat(audio): add cookbook for audio transformers integration
1 parent 5d3142d commit d7d6b65

File tree

1 file changed

+200
-0
lines changed

1 file changed

+200
-0
lines changed

docs/cookbook/audio_understanding.md

+200
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# Generate structured output for audio understanding
2+
3+
Even though audio-LM models for audio-text-to-text tasks are still pretty niche, they are still useful (and fun) to analyse, extract informations, translate or transcript speeches.
4+
5+
This cookbook highlights the new integration of audio-LM and has been tested with `Qwen/Qwen2-Audio-7B-Instruct` ([HF link](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct)).
6+
7+
## Setup
8+
9+
As usual let's have the right packages
10+
11+
```bash
12+
pip install outlines torch==2.4.0 transformers accelerate librosa
13+
```
14+
15+
So that you can import as follow:
16+
17+
```python
18+
# LLM stuff
19+
import outlines
20+
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
21+
22+
# Audio stuff
23+
import librosa
24+
from io import BytesIO
25+
from urllib.request import urlopen
26+
27+
# Some ooo stuff
28+
from enum import Enum
29+
from pydantic import BaseModel
30+
from typing import Optional
31+
```
32+
33+
## Load the model and processor
34+
35+
To achieve audio analysis we will need a model and its processor to pre-process prompts and audio. Let's do as follow:
36+
37+
```python
38+
qwen2_audio = outlines.models.transformers_vision(
39+
"Qwen/Qwen2-Audio-7B-Instruct",
40+
model_class=Qwen2AudioForConditionalGeneration,
41+
model_kwargs={
42+
"device_map": "auto",
43+
"torch_dtype": torch.bfloat16,
44+
},
45+
processor_kwargs={
46+
"device": "cuda", # set to "cpu" if you don't have a GPU
47+
},
48+
)
49+
50+
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
51+
```
52+
53+
Let's also define a useful audio extractor from conversational prompts:
54+
55+
```pyton
56+
def audio_extractor(conversation):
57+
audios = []
58+
for message in conversation:
59+
if isinstance(message["content"], list):
60+
for elt in message["content"]:
61+
if elt["type"] == "audio":
62+
audios.append(
63+
librosa.load(
64+
BytesIO(urlopen(elt['audio_url']).read()),
65+
sr=processor.feature_extractor.sampling_rate
66+
)[0]
67+
)
68+
return audios
69+
```
70+
71+
## Question answering
72+
73+
Let's say we want to analyse and answer the question of the lady in this [audio](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav).
74+
75+
### Data structure
76+
77+
To have a structured data output, we can define the following data model:
78+
79+
```python
80+
class Age(int, Enum):
81+
twenties = 20
82+
fifties = 50
83+
84+
class Gender(str, Enum):
85+
male = "male"
86+
female = "female"
87+
88+
class Person(BaseModel):
89+
gender: Gender
90+
age: Age
91+
language: Optional[str]
92+
```
93+
94+
### Prompting
95+
96+
Let's have the following prompt to ask our model:
97+
98+
```python
99+
audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"
100+
101+
conversation = [
102+
{"role": "system", "content": "You are a helpful assistant."},
103+
{"role": "user", "content": [
104+
{"type": "audio", "audio_url": audio_url},
105+
{
106+
"type": "text",
107+
"text": f"""As asked in the audio, what is the gender and the age of the speaker?
108+
109+
Return the information in the following JSON schema:
110+
{Person.model_json_schema()}
111+
"""
112+
},
113+
]},
114+
]
115+
```
116+
117+
But we cannot pass it raw! We need to pre-process it and handle the audio file.
118+
119+
```python
120+
audios = audio_extractor(conversation)
121+
122+
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
123+
```
124+
125+
Now we're ready to ask our model!
126+
127+
### Run the model
128+
129+
As usual with the outlines' framework, we will instantiate a generator that specifically struture the output based on our data model:
130+
131+
```python
132+
person_generator = outlines.generate.json(
133+
qwen2_audio,
134+
Person,
135+
sampler=outlines.samplers.greedy()
136+
)
137+
```
138+
139+
That runs just like:
140+
141+
```python
142+
result = person_generator(prompt, audios)
143+
```
144+
145+
And you are expecting to get a result as follow:
146+
```
147+
Person(
148+
gender=<Gender.female: 'female'>,
149+
age=<Age.twenties: 20>,
150+
language='English'
151+
)
152+
```
153+
154+
## Classification
155+
156+
Now we can focus on this [audio](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3) of a glass breaking.
157+
158+
The integration of audio transformers, allows you to use all the functionalities of the outlines' API such as the `choice` method. We can do as follow:
159+
160+
### Prompting
161+
162+
Let's consider the following prompt and pre-process our audio:
163+
164+
```python
165+
audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"
166+
167+
conversation = [
168+
{"role": "system", "content": "You are a helpful assistant."},
169+
{"role": "user", "content": [
170+
{"type": "audio", "audio_url": audio_url},
171+
{
172+
"type": "text",
173+
"text": "Do you hear a dog barking or a glass breaking?"
174+
},
175+
]},
176+
]
177+
178+
audios = audio_extractor(conversation)
179+
180+
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
181+
```
182+
183+
### Run the model
184+
185+
As mentioned, we will use the `choice` method to generate our structured output:
186+
187+
```python
188+
choice_generator = outlines.generate.choice(
189+
qwen2_audio,
190+
["dog barking", "glass breaking"],
191+
)
192+
193+
result = choice_generator(prompt, audios)
194+
```
195+
196+
And you are expected to have:
197+
```python
198+
print(result)
199+
# "glass breaking"
200+
```

0 commit comments

Comments
 (0)