Introduction

Several months ago, I posed a question in the r/audacity subreddit regarding the differences between various OpenVINO Whisper Transcription models and their impact on transcription quality. Having received no response, I conducted this comparative study independently.

Evaluating the performance of these models across diverse audio content is essential for a comprehensive assessment. This report compares four audio processing models: base, small, medium, and large-v3. The analysis encompasses scores achieved by each model on ten different audio tracks (labelled Track 1 through Track 10), along with their respective processing durations. The objective of this analysis is to provide a data-driven foundation for assessing each model’s effectiveness, examining their processing efficiency, and identifying the strengths and weaknesses of each model.

Analysis Data

All tracks, outputs from the different Audacity models, source code used to generate the scores, and intermediate processing stages along with summary data are available for download from GitHub: https://github.com/chribonn/Audacity-OpenVINO-Comparison.

This repository enables others to validate these findings and generate additional observations, with the ultimate goal of improving Audacity’s transcription process in terms of both performance and accuracy.

Key Insights

Base model is fastest but also least accurate.
Small model offers the best balance — slightly better than large-v3 in accuracy while being significantly faster.
Medium and large-v3 demonstrate diminishing returns: higher processing time but only modest accuracy gains.
Certain audio tracks proved challenging for all models.
Combining models (e.g., small for most tracks, large-v3 for specific cases) could optimise accuracy. This approach could be implemented as an enhancement in Audacity.

Methodology

Ten audio tracks were selected for this study. All tracks contained English language content but varied in length, file type, audio quality, accents, and presence of background sounds or distortions. Each track was analysed using four different OpenVINO LLM models: base, small, medium, and large-v3.

The study utilised Audacity version 3.7.3 with OpenVINO AP Plugins Revision R4.2. For additional information on installing and enabling these OpenVINO plugins, refer to the tutorial video available at https://youtu.be/Szde2_casiE.

Track Name	Type	Bit rate (kbps)	Channels	Sample rate (kHz)	Duration	Notes
Track 1	MP3	79	2 (stereo)	48.000	00:01:05	https://youtu.be/ol0gPsxOOZo
Track 2	OGG	1411	2 (stereo)	44.100	00:00:22
Track 3	FLAC	401			00:03:57	TTSSource.txt is the text fed into the TTS Engine
Track 4	MP3	128	1 (mono)	44.100	00:04:43	Source : https://www.archive.org/download/soup_alphabets_002_0809_librivox/alphabets002_03_nonsense2_arb.mp3
Track 5	MP3	16	1 (mono)	11.025	00:04:16	TWIT Security Now Episode #790 \| 27 Oct 2020 https://www.grc.com/sn/past/2020.htm
Track 6	MP3	192	2 (stereo)	44.100	00:04:48	https://learnenglish.britishcouncil.org/general-english/audio-zone/beating-stress
Track 7	WAV	768	1 (mono)	48.000	00:02:19	University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0002.wav)
Track 8	WAV	768	1 (mono)	48.000	00:02:16	University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0006.wav)
Track 9	WAV	768	1 (mono)	48.000	00:01:54	University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0011.wav)
Track 10	WAV	768	1 (mono)	48.000	00:02:13	University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0299.wav)

The file information was sourced from the Details tab of Microsoft Windows Properties dialogue box.

To eliminate any potential influence from prior processing, each audio file/LLM model test was freshly opened in Audacity, processed, saved, and then Audacity was closed before the next test (40 test iterations in total).

Furthermore, the processing order of models was varied as shown below to mitigate potential order effects.

Track Name	Process Order
Track 1	base, small, medium, large-v3
Track 2	large-v3, medium, small, base
Track 3	medium, small, large-v3, base
Track 4	large-v3, base, medium, small
Track 5	small, large-v3, base, medium
Track 6	medium, small, large-v3, base
Track 7	base, medium, small, large-v3
Track 8	base, large-v3, medium, small
Track 9	large-v3, small, medium, base
Track 10	small, medium, base, large-v3)

Hardware Specifications

AI processing is computationally intensive, and the timing results presented are influenced by the available computing resources. While it is known that hardware specification will impact processing time, it is not know if there is a correlation between the hardware specs and transcription quality.

From the Windows 11 System Information report

Item	Value
OS Name	Microsoft Windows 11 Pro
Version	10.0.26100 Build 26100
Other OS Description	Not Available
OS Manufacturer	Microsoft Corporation
System Manufacturer	ASUS
System Model	System Product Name
System Type	x64-based PC
System SKU	SKU
Processor	AMD Ryzen 9 7900X3D 12-Core Processor, 4401 Mhz, 12 Core(s), 24 Logical Processor(s)
BIOS Version/Date	American Megatrends Inc. 1813, 13/10/2023
SMBIOS Version	3.5
Embedded Controller Version	255.255
BIOS Mode	UEFI
BaseBoard Manufacturer	ASUSTeK COMPUTER INC.
BaseBoard Product	PRIME X670E-PRO WIFI
BaseBoard Version	Rev 1.xx
Platform Role	Desktop
Installed Physical Memory (RAM)	64.0 GB
Display	NVidia GeForce RTX 4070

The GPU was not utilised for any of the tests for the following reasons:

OpenVINO is not compatible with CUDA out-of-the-box

The GPU option was not available in the OpenVINO Inference Device drop-down list

In the OpenVINO Whisper Transcription dialogue box the only setting that was adjusted was the Whisper Model.

Processing Speed Analysis

Although the primary focus was not on processing time, these metrics were recorded and are presented here. It should be noted that timing was conducted visually using a stopwatch, introducing some measurement error. This error would be proportionally larger for processes completing within seconds.

During transcription, Audacity displays a dialogue box with a timer and progress bar, but this interface closes automatically upon completion, and the progress bar advances in discrete increments rather than smoothly. These factors may have contributed to timing inaccuracies.

Track Name	base	small	medium	large-v3
Track 1	1.0	3.8	10.9	19.7
Track 2	1.6	3.6	9.7	18.5
Track 3	8.2	18.9	51.2	140.0
Track 4	11.2	37.2	64.0	120.0
Track 5	10.6	24.9	65.0	117.9
Track 6	9.6	21.9	61.9	113.9
Track 7	4.5	11.8	34.0	46.6
Track 8	4.2	12.0	29.6	98.3
Track 9	3.2	8.4	32.9	75.8
Track 10	4.5	12.0	31.1	62.6
AVERAGE	5.9	15.5	39.0	81.3
Range	75.47
Variance	1,128.68
Standard Deviation	33.60

Based on these observations, there is a considerable performance spread between the different models, with larger models requiring significantly more time to process the audio files.

Transcription Processing Methodology

Each audio file was placed in a folder with a corresponding name. The file was processed by each LLM model, and the resulting transcription text file was saved in the same folder. The four transcript files were named: Transcription(base).txt, Transcription(small).txt, Transcription(medium).txt, and Transcription(large-v3).txt.

The Audacity-generated transcription files follow this format:

Track 1 / large-v3

0.000000	9.560000	I’m going to walk over the process to download VMware Workstation Pro.
9.560000	15.340000	A few months ago, VMware was acquired by Broadcom,
15.340000	25.600000	and Broadcom soon after announced that VMware Workstation and Fusion were free for private use.

Track 1 / base

0.000000	10.720000	I’m going to walk over the process to download VMware Workstation Pro.
10.720000	21.120000	A few months ago VMware was acquired by Broadcom and Broadcom soon after announced that VMware
21.120000	33.200000	Workstation and Fusion were free for private use. To download these products, you need to register

The first two (tab-separated) columns represent the start and end time of the transcribed text in seconds.milliseconds. Different models segment the text differently and return different timings for the same sentence. This segmentation variability is an interesting phenomenon that warrants further investigation.

The structural differences in the generated files necessitated additional processing to enable meaningful comparison, as raw comparisons would include extraneous differences that could confound the scope of this study.

For consistency, the source-compare-to file was established based on the timings of the large-v3 output and named Source.txt.

A Python script (audacity_transcription_score.py) was created to process the files and generate the scores, with output directed to a file called output.md.

Certain tracks have additional files. These were named to reflect their purpose. They were not used in the analysis.

Score Generation Methodology

Scoring Algorithm

Scores were computed using Python’s difflib.SequenceMatcher class (https://docs.python.org/3/library/difflib.html).

The Python code performed three types of comparisons between the Source and Transcribed text files:

Unprocessed - Comparison based on unaltered files including timings. These results are inherently biased since the source-compare-to file is based on large-v3 timings. This metric was included primarily to illustrate the structural differences in output between models.
No Timings/LCase/no NL - Comparison after removing Audacity-generated timings, converting text to lowercase, and consolidating the output into a single string. This normalisation was necessary because, different models transcribed the same text with varying capitalisation.
No Timings/LCase/no NL Punct - Further processing that also removes period, comma and double quote characters from the input.

All processing outputs were displayed on screen and written to the file output.md.

Results

Track 1	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	7.00	21.10	100.00
small	41.23	15.58	100.00
medium	60.13	82.56	100.00
large-v3	99.26	97.11	100.00

Track 2	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	14.00	11.09	12.08
small	32.76	23.45	96.65
medium	40.66	88.21	100.00
large-v3	53.82	65.22	100.00

Track 3	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	31.36	95.16	94.98
small	29.33	96.16	97.26
medium	32.57	96.25	97.16
large-v3	89.76	83.56	84.13

Track 4	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	9.28	16.75	17.77
small	5.56	13.39	16.63
medium	1.63	4.62	9.90
large-v3	50.69	16.41	23.08

Track 5	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	30.11	42.71	69.38
small	15.91	63.89	74.65
medium	12.44	38.58	49.73
large-v3	29.37	73.48	83.77

Track 6	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	19.75	4.52	4.82
small	42.97	33.55	25.58
medium	40.99	33.32	23.95
large-v3	75.88	4.78	23.09

Track 7	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	9.53	10.15	10.58
small	31.68	31.47	95.70
medium	37.88	31.40	95.53
large-v3	85.20	28.11	34.72

Track 8	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	36.73	96.19	99.70
small	48.91	96.68	99.89
medium	22.04	98.26	99.84
large-v3	97.52	80.75	95.19

Track 9	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	19.87	3.33	28.80
small	11.95	6.14	47.21
medium	24.74	11.50	37.50
large-v3	97.39	32.57	99.32

Track 10	Unprocessed	No Timings/LCase/no NL	No Timings/LCase/no NL Punct
base	13.95	5.26	8.94
small	32.33	31.43	41.79
medium	20.57	37.92	47.58
large-v3	91.18	42.20	47.66

Key Observations based on the No Timings/LCase/no NL Punct scores

Using scores derived from text with periods, commas, and double quotes removed may be interpreted as not accurately reflecting the transcription process. While this is an important consideration, all tracks—except for Track 1—lack the source text. Anyone wishing to account for these characters could perform a similar analysis based on the No Timings/LCase/no NL scoring.

The removal of end-of-paragraph markers was justified by the fact that Audacity does not account for them during the transcription process.

Average Score per Model

Model	Average Score
base	44.71
small	69.54
medium	66.12
large-v3	69.10

Top scorer: 🔸 small

Lowest scorer: 🔹 base

Average Duration per Model

Model	Average Duration (seconds)
base	5.86
small	15.45
medium	39.03
large-v3	81.33

Key finding: Larger models consistently require more processing time.

Without Track 1 (to reduce skewing)

Track 1 was the only track in which all the models scored 100. Since this was not observed in any of the other tracks, it was considered that this track might not accurately reflect real-world transcription processing and could be skewing the data.

The same analysis was then performed on the remaining nine tracks.

Average Score per Model (Tracks 2–10)

Model	Average Score
base	38.24
small	66.93
medium	61.11
large-v3	65.22

Top scorer: 🔸 small

Lowest scorer: 🔹 base

Average Duration per Model

Model	Average Duration (seconds)
base	6.96
small	16.54
medium	45.27
large-v3	88.62

Key finding: Larger models consistently require more processing time.

Conclusions

Small model offers the best trade-off: Good average score and relatively short processing time.

Large-v3 isn’t always the best, despite longest processing duration.

Base model’s performance is consistently poor.

Some tracks (e.g., 6, 9, 10) are challenging across models. (It would be interesting to understand what are the factors that cause this.)

Follow This, That and (Maybe), the Other:

Search This Blog

TT(O)M

Comparing Audacity's OpenVINO Whisper Transcription LLMs

Introduction

Analysis Data

Key Insights

Methodology

Hardware Specifications

Processing Speed Analysis

Transcription Processing Methodology

Track 1 / large-v3

Track 1 / base

Score Generation Methodology

Scoring Algorithm

Results

Key Observations based on the No Timings/LCase/no NL Punct scores

Average Score per Model

Average Duration per Model

Without Track 1 (to reduce skewing)

Average Score per Model (Tracks 2–10)

Average Duration per Model

Conclusions

Comments

Post a Comment

Popular posts from this blog

20150628 Giarratana Circular

HOWTO setup OpenVPN server and client configuration files using EasyRSA

How to clone and synchronise a GitHub repository on Android