Comparing Audacity's OpenVINO Whisper Transcription LLMs

Introduction

Several months ago, I posed a question in the r/audacity subreddit regarding the differences between various OpenVINO Whisper Transcription models and their impact on transcription quality. Having received no response, I conducted this comparative study independently.

Evaluating the performance of these models across diverse audio content is essential for a comprehensive assessment. This report compares four audio processing models: base, small, medium, and large-v3. The analysis encompasses scores achieved by each model on ten different audio tracks (labelled Track 1 through Track 10), along with their respective processing durations. The objective of this analysis is to provide a data-driven foundation for assessing each model’s effectiveness, examining their processing efficiency, and identifying the strengths and weaknesses of each model.

 

Analysis Data

All tracks, outputs from the different Audacity models, source code used to generate the scores, and intermediate processing stages along with summary data are available for download from GitHub: https://github.com/chribonn/Audacity-OpenVINO-Comparison.

This repository enables others to validate these findings and generate additional observations, with the ultimate goal of improving Audacity’s transcription process in terms of both performance and accuracy.

 

Key Insights

  • Base model is fastest but also least accurate.
  • Small model offers the best balance — slightly better than large-v3 in accuracy while being significantly faster.
  • Medium and large-v3 demonstrate diminishing returns: higher processing time but only modest accuracy gains.
  • Certain audio tracks proved challenging for all models.
  • Combining models (e.g., small for most tracks, large-v3 for specific cases) could optimise accuracy. This approach could be implemented as an enhancement in Audacity.

 

Methodology

Ten audio tracks were selected for this study. All tracks contained English language content but varied in length, file type, audio quality, accents, and presence of background sounds or distortions. Each track was analysed using four different OpenVINO LLM models: base, small, medium, and large-v3.

The study utilised Audacity version 3.7.3 with OpenVINO AP Plugins Revision R4.2. For additional information on installing and enabling these OpenVINO plugins, refer to the tutorial video available at https://youtu.be/Szde2_casiE.

Track Name Type Bit rate (kbps) Channels Sample rate (kHz) Duration Notes
Track 1 MP3 79 2 (stereo) 48.000 00:01:05 https://youtu.be/ol0gPsxOOZo
Track 2 OGG 1411 2 (stereo) 44.100 00:00:22  
Track 3 FLAC 401     00:03:57 TTSSource.txt is the text fed into the TTS Engine
Track 4 MP3 128 1 (mono) 44.100 00:04:43 Source : https://www.archive.org/download/soup_alphabets_002_0809_librivox/alphabets002_03_nonsense2_arb.mp3
Track 5 MP3 16 1 (mono) 11.025 00:04:16 TWIT Security Now Episode #790 | 27 Oct 2020 https://www.grc.com/sn/past/2020.htm
Track 6 MP3 192 2 (stereo) 44.100 00:04:48 https://learnenglish.britishcouncil.org/general-english/audio-zone/beating-stress
Track 7 WAV 768 1 (mono) 48.000 00:02:19 University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language
https://datashare.ed.ac.uk/handle/10283/392  (R0002.wav)
Track 8 WAV 768 1 (mono) 48.000 00:02:16 University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language
https://datashare.ed.ac.uk/handle/10283/392  (R0006.wav)
Track 9 WAV 768 1 (mono) 48.000 00:01:54 University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language
https://datashare.ed.ac.uk/handle/10283/392  (R0011.wav)
Track 10 WAV 768 1 (mono) 48.000 00:02:13 University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language
https://datashare.ed.ac.uk/handle/10283/392  (R0299.wav)

The file information was sourced from the Details tab of Microsoft Windows Properties dialogue box.

 

To eliminate any potential influence from prior processing, each audio file/LLM model test was freshly opened in Audacity, processed, saved, and then Audacity was closed before the next test (40 test iterations in total).

 

 

Furthermore, the processing order of models was varied as shown below to mitigate potential order effects.

 

Track Name Process Order
Track 1 base, small, medium, large-v3
Track 2 large-v3, medium, small, base
Track 3 medium, small, large-v3, base
Track 4 large-v3, base, medium, small
Track 5 small, large-v3, base, medium
Track 6 medium, small, large-v3, base
Track 7 base, medium, small, large-v3
Track 8 base, large-v3, medium, small
Track 9 large-v3, small, medium, base
Track 10 small, medium, base, large-v3)

 

Hardware Specifications

AI processing is computationally intensive, and the timing results presented are influenced by the available computing resources. While it is known that hardware specification will impact processing time, it is not know if there is a correlation between the hardware specs and transcription quality.

From the Windows 11 System Information report

Item Value
OS Name Microsoft Windows 11 Pro
Version 10.0.26100 Build 26100
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Manufacturer ASUS
System Model System Product Name
System Type x64-based PC
System SKU SKU
Processor AMD Ryzen 9 7900X3D 12-Core Processor, 4401 Mhz, 12 Core(s), 24 Logical Processor(s)
BIOS Version/Date American Megatrends Inc. 1813, 13/10/2023
SMBIOS Version 3.5
Embedded Controller Version 255.255
BIOS Mode UEFI
BaseBoard Manufacturer ASUSTeK COMPUTER INC.
BaseBoard Product PRIME X670E-PRO WIFI
BaseBoard Version Rev 1.xx
Platform Role Desktop
Installed Physical Memory (RAM) 64.0 GB
Display NVidia GeForce RTX 4070

 

The GPU was not utilised for any of the tests for the following reasons:

  • OpenVINO is not compatible with CUDA out-of-the-box

  • The GPU option was not available in the OpenVINO Inference Device drop-down list

In the OpenVINO Whisper Transcription dialogue box the only setting that was adjusted was the Whisper Model.

 

 

Processing Speed Analysis

Although the primary focus was not on processing time, these metrics were recorded and are presented here. It should be noted that timing was conducted visually using a stopwatch, introducing some measurement error. This error would be proportionally larger for processes completing within seconds.

During transcription, Audacity displays a dialogue box with a timer and progress bar, but this interface closes automatically upon completion, and the progress bar advances in discrete increments rather than smoothly. These factors may have contributed to timing inaccuracies.

 

Track Name base small medium large-v3
Track 1 1.0 3.8 10.9 19.7
Track 2 1.6 3.6 9.7 18.5
Track 3 8.2 18.9 51.2 140.0
Track 4 11.2 37.2 64.0 120.0
Track 5 10.6 24.9 65.0 117.9
Track 6 9.6 21.9 61.9 113.9
Track 7 4.5 11.8 34.0 46.6
Track 8 4.2 12.0 29.6 98.3
Track 9 3.2 8.4 32.9 75.8
Track 10 4.5 12.0 31.1 62.6
AVERAGE 5.9 15.5 39.0 81.3
Range 75.47
Variance 1,128.68
Standard Deviation 33.60

Based on these observations, there is a considerable performance spread between the different models, with larger models requiring significantly more time to process the audio files.

 

Transcription Processing Methodology

Each audio file was placed in a folder with a corresponding name. The file was processed by each LLM model, and the resulting transcription text file was saved in the same folder. The four transcript files were named: Transcription(base).txt, Transcription(small).txt, Transcription(medium).txt, and Transcription(large-v3).txt.

The Audacity-generated transcription files follow this format:

Track 1 / large-v3

0.000000 9.560000 I’m going to walk over the process to download VMware Workstation Pro.
9.560000 15.340000 A few months ago, VMware was acquired by Broadcom,
15.340000 25.600000  and Broadcom soon after announced that VMware Workstation and Fusion were free for private use.

Track 1 / base

0.000000 10.720000 I’m going to walk over the process to download VMware Workstation Pro.
10.720000 21.120000  A few months ago VMware was acquired by Broadcom and Broadcom soon after announced that VMware
21.120000 33.200000 Workstation and Fusion were free for private use. To download these products, you need to register

 

The first two (tab-separated) columns represent the start and end time of the transcribed text in seconds.milliseconds. Different models segment the text differently and return different timings for the same sentence. This segmentation variability is an interesting phenomenon that warrants further investigation.

The structural differences in the generated files necessitated additional processing to enable meaningful comparison, as raw comparisons would include extraneous differences that could confound the scope of this study.

For consistency, the source-compare-to file was established based on the timings of the large-v3 output and named Source.txt.

A Python script (audacity_transcription_score.py) was created to process the files and generate the scores, with output directed to a file called output.md.

Certain tracks have additional files. These were named to reflect their purpose. They were not used in the analysis.

 

Score Generation Methodology

Scoring Algorithm

Scores were computed using Python’s difflib.SequenceMatcher class (https://docs.python.org/3/library/difflib.html).

The Python code performed three types of comparisons between the Source and Transcribed text files:

  1. Unprocessed - Comparison based on unaltered files including timings. These results are inherently biased since the source-compare-to file is based on large-v3 timings. This metric was included primarily to illustrate the structural differences in output between models.
  2. No Timings/LCase/no NL - Comparison after removing Audacity-generated timings, converting text to lowercase, and consolidating the output into a single string. This normalisation was necessary because, different models transcribed the same text with varying capitalisation.
  3. No Timings/LCase/no NL Punct - Further processing that also removes period, comma and double quote characters from the input.

All processing outputs were displayed on screen and written to the file output.md.

 

Results

Track 1 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 7.00 21.10 100.00
small 41.23 15.58 100.00
medium 60.13 82.56 100.00
large-v3 99.26 97.11 100.00
Track 2 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 14.00 11.09 12.08
small 32.76 23.45 96.65
medium 40.66 88.21 100.00
large-v3 53.82 65.22 100.00
Track 3 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 31.36 95.16 94.98
small 29.33 96.16 97.26
medium 32.57 96.25 97.16
large-v3 89.76 83.56 84.13
Track 4 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 9.28 16.75 17.77
small 5.56 13.39 16.63
medium 1.63 4.62 9.90
large-v3 50.69 16.41 23.08
Track 5 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 30.11 42.71 69.38
small 15.91 63.89 74.65
medium 12.44 38.58 49.73
large-v3 29.37 73.48 83.77
Track 6 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 19.75 4.52 4.82
small 42.97 33.55 25.58
medium 40.99 33.32 23.95
large-v3 75.88 4.78 23.09
Track 7 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 9.53 10.15 10.58
small 31.68 31.47 95.70
medium 37.88 31.40 95.53
large-v3 85.20 28.11 34.72
Track 8 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 36.73 96.19 99.70
small 48.91 96.68 99.89
medium 22.04 98.26 99.84
large-v3 97.52 80.75 95.19
Track 9 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 19.87 3.33 28.80
small 11.95 6.14 47.21
medium 24.74 11.50 37.50
large-v3 97.39 32.57 99.32
Track 10 Unprocessed No Timings/LCase/no NL No Timings/LCase/no NL Punct
base 13.95 5.26 8.94
small 32.33 31.43 41.79
medium 20.57 37.92 47.58
large-v3 91.18 42.20 47.66

 

Key Observations based on the No Timings/LCase/no NL Punct scores

Using scores derived from text with periods, commas, and double quotes removed may be interpreted as not accurately reflecting the transcription process. While this is an important consideration, all tracks—except for Track 1—lack the source text. Anyone wishing to account for these characters could perform a similar analysis based on the No Timings/LCase/no NL scoring.

The removal of end-of-paragraph markers was justified by the fact that Audacity does not account for them during the transcription process.

 

Average Score per Model

Model Average Score
base 44.71
small 69.54
medium 66.12
large-v3 69.10

Top scorer: 🔸 small

Lowest scorer: 🔹 base

 

Average Duration per Model

Model Average Duration (seconds)
base 5.86
small 15.45
medium 39.03
large-v3 81.33
 
Key finding: Larger models consistently require more processing time.

 

Without Track 1 (to reduce skewing)

Track 1 was the only track in which all the models scored 100. Since this was not observed in any of the other tracks, it was considered that this track might not accurately reflect real-world transcription processing and could be skewing the data.

The same analysis was then performed on the remaining nine tracks.

 

Average Score per Model (Tracks 2–10)

Model Average Score
base 38.24
small 66.93
medium 61.11
large-v3 65.22

Top scorer: 🔸 small

Lowest scorer: 🔹 base

 

Average Duration per Model

Model Average Duration (seconds)
base 6.96
small 16.54
medium 45.27
large-v3 88.62
 
Key finding: Larger models consistently require more processing time.

 

Conclusions

Small model offers the best trade-off: Good average score and relatively short processing time.

Large-v3 isn’t always the best, despite longest processing duration.

Base model’s performance is consistently poor.

Some tracks (e.g., 6, 9, 10) are challenging across models. (It would be interesting to understand what are the factors that cause this.)

 


Follow This, That and (Maybe), the Other:

 

 

Comments

Popular posts from this blog

20150628 Giarratana Circular

HOWTO setup OpenVPN server and client configuration files using EasyRSA

How to clone and synchronise a GitHub repository on Android