Comparing Audacity's OpenVINO Whisper Transcription LLMs
Introduction
Several months ago, I posed a question in the r/audacity subreddit regarding the differences between various OpenVINO Whisper Transcription models and their impact on transcription quality. Having received no response, I conducted this comparative study independently.
Evaluating the performance of these models across diverse audio content is essential for a comprehensive assessment. This report compares four audio processing models: base, small, medium, and large-v3. The analysis encompasses scores achieved by each model on ten different audio tracks (labelled Track 1 through Track 10), along with their respective processing durations. The objective of this analysis is to provide a data-driven foundation for assessing each model’s effectiveness, examining their processing efficiency, and identifying the strengths and weaknesses of each model.
Analysis Data
All tracks, outputs from the different Audacity models, source code used to generate the scores, and intermediate processing stages along with summary data are available for download from GitHub: https://github.com/chribonn/Audacity-OpenVINO-Comparison.
This repository enables others to validate these findings and generate additional observations, with the ultimate goal of improving Audacity’s transcription process in terms of both performance and accuracy.
Key Insights
- Base model is fastest but also least accurate.
- Small model offers the best balance — slightly better than large-v3 in accuracy while being significantly faster.
- Medium and large-v3 demonstrate diminishing returns: higher processing time but only modest accuracy gains.
- Certain audio tracks proved challenging for all models.
- Combining models (e.g., small for most tracks, large-v3 for specific cases) could optimise accuracy. This approach could be implemented as an enhancement in Audacity.
Methodology
Ten audio tracks were selected for this study. All tracks contained English language content but varied in length, file type, audio quality, accents, and presence of background sounds or distortions. Each track was analysed using four different OpenVINO LLM models: base, small, medium, and large-v3.
The study utilised Audacity version 3.7.3 with OpenVINO AP Plugins Revision R4.2. For additional information on installing and enabling these OpenVINO plugins, refer to the tutorial video available at https://youtu.be/Szde2_casiE.
Track Name | Type | Bit rate (kbps) | Channels | Sample rate (kHz) | Duration | Notes |
---|---|---|---|---|---|---|
Track 1 | MP3 | 79 | 2 (stereo) | 48.000 | 00:01:05 | https://youtu.be/ol0gPsxOOZo |
Track 2 | OGG | 1411 | 2 (stereo) | 44.100 | 00:00:22 | |
Track 3 | FLAC | 401 | 00:03:57 | TTSSource.txt is the text fed into the TTS Engine | ||
Track 4 | MP3 | 128 | 1 (mono) | 44.100 | 00:04:43 | Source : https://www.archive.org/download/soup_alphabets_002_0809_librivox/alphabets002_03_nonsense2_arb.mp3 |
Track 5 | MP3 | 16 | 1 (mono) | 11.025 | 00:04:16 | TWIT Security Now Episode #790 | 27 Oct 2020 https://www.grc.com/sn/past/2020.htm |
Track 6 | MP3 | 192 | 2 (stereo) | 44.100 | 00:04:48 | https://learnenglish.britishcouncil.org/general-english/audio-zone/beating-stress |
Track 7 | WAV | 768 | 1 (mono) | 48.000 | 00:02:19 | University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0002.wav) |
Track 8 | WAV | 768 | 1 (mono) | 48.000 | 00:02:16 | University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0006.wav) |
Track 9 | WAV | 768 | 1 (mono) | 48.000 | 00:01:54 | University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0011.wav) |
Track 10 | WAV | 768 | 1 (mono) | 48.000 | 00:02:13 | University of Edinburgh. School of Philosophy, Psychology, and Language Sciences. Department of Linguistics and English Language https://datashare.ed.ac.uk/handle/10283/392 (R0299.wav) |
The file information was sourced from the Details tab of Microsoft Windows Properties dialogue box.
To eliminate any potential influence from prior processing, each audio file/LLM model test was freshly opened in Audacity, processed, saved, and then Audacity was closed before the next test (40 test iterations in total).
Furthermore, the processing order of models was varied as shown below to mitigate potential order effects.
Track Name | Process Order |
---|---|
Track 1 | base, small, medium, large-v3 |
Track 2 | large-v3, medium, small, base |
Track 3 | medium, small, large-v3, base |
Track 4 | large-v3, base, medium, small |
Track 5 | small, large-v3, base, medium |
Track 6 | medium, small, large-v3, base |
Track 7 | base, medium, small, large-v3 |
Track 8 | base, large-v3, medium, small |
Track 9 | large-v3, small, medium, base |
Track 10 | small, medium, base, large-v3) |
Hardware Specifications
AI processing is computationally intensive, and the timing results presented are influenced by the available computing resources. While it is known that hardware specification will impact processing time, it is not know if there is a correlation between the hardware specs and transcription quality.
From the Windows 11 System Information report
Item | Value |
---|---|
OS Name | Microsoft Windows 11 Pro |
Version | 10.0.26100 Build 26100 |
Other OS Description | Not Available |
OS Manufacturer | Microsoft Corporation |
System Manufacturer | ASUS |
System Model | System Product Name |
System Type | x64-based PC |
System SKU | SKU |
Processor | AMD Ryzen 9 7900X3D 12-Core Processor, 4401 Mhz, 12 Core(s), 24 Logical Processor(s) |
BIOS Version/Date | American Megatrends Inc. 1813, 13/10/2023 |
SMBIOS Version | 3.5 |
Embedded Controller Version | 255.255 |
BIOS Mode | UEFI |
BaseBoard Manufacturer | ASUSTeK COMPUTER INC. |
BaseBoard Product | PRIME X670E-PRO WIFI |
BaseBoard Version | Rev 1.xx |
Platform Role | Desktop |
Installed Physical Memory (RAM) | 64.0 GB |
Display | NVidia GeForce RTX 4070 |
The GPU was not utilised for any of the tests for the following reasons:
OpenVINO is not compatible with CUDA out-of-the-box
The GPU option was not available in the OpenVINO Inference Device drop-down list
In the OpenVINO Whisper Transcription dialogue box the only setting that was adjusted was the Whisper Model.
Processing Speed Analysis
Although the primary focus was not on processing time, these metrics were recorded and are presented here. It should be noted that timing was conducted visually using a stopwatch, introducing some measurement error. This error would be proportionally larger for processes completing within seconds.
During transcription, Audacity displays a dialogue box with a timer and progress bar, but this interface closes automatically upon completion, and the progress bar advances in discrete increments rather than smoothly. These factors may have contributed to timing inaccuracies.
Track Name | base | small | medium | large-v3 |
---|---|---|---|---|
Track 1 | 1.0 | 3.8 | 10.9 | 19.7 |
Track 2 | 1.6 | 3.6 | 9.7 | 18.5 |
Track 3 | 8.2 | 18.9 | 51.2 | 140.0 |
Track 4 | 11.2 | 37.2 | 64.0 | 120.0 |
Track 5 | 10.6 | 24.9 | 65.0 | 117.9 |
Track 6 | 9.6 | 21.9 | 61.9 | 113.9 |
Track 7 | 4.5 | 11.8 | 34.0 | 46.6 |
Track 8 | 4.2 | 12.0 | 29.6 | 98.3 |
Track 9 | 3.2 | 8.4 | 32.9 | 75.8 |
Track 10 | 4.5 | 12.0 | 31.1 | 62.6 |
AVERAGE | 5.9 | 15.5 | 39.0 | 81.3 |
Range | 75.47 | |||
Variance | 1,128.68 | |||
Standard Deviation | 33.60 |
Based on these observations, there is a considerable performance spread between the different models, with larger models requiring significantly more time to process the audio files.
Transcription Processing Methodology
Each audio file was placed in a folder with a corresponding name. The file was processed by each LLM model, and the resulting transcription text file was saved in the same folder. The four transcript files were named: Transcription(base).txt, Transcription(small).txt, Transcription(medium).txt, and Transcription(large-v3).txt.
The Audacity-generated transcription files follow this format:
Track 1 / large-v3
0.000000 | 9.560000 | I’m going to walk over the process to download VMware Workstation Pro. |
9.560000 | 15.340000 | A few months ago, VMware was acquired by Broadcom, |
15.340000 | 25.600000 | and Broadcom soon after announced that VMware Workstation and Fusion were free for private use. |
Track 1 / base
0.000000 | 10.720000 | I’m going to walk over the process to download VMware Workstation Pro. |
10.720000 | 21.120000 | A few months ago VMware was acquired by Broadcom and Broadcom soon after announced that VMware |
21.120000 | 33.200000 | Workstation and Fusion were free for private use. To download these products, you need to register |
The first two (tab-separated) columns represent the start and end time of the transcribed text in seconds.milliseconds. Different models segment the text differently and return different timings for the same sentence. This segmentation variability is an interesting phenomenon that warrants further investigation.
The structural differences in the generated files necessitated additional processing to enable meaningful comparison, as raw comparisons would include extraneous differences that could confound the scope of this study.
For consistency, the source-compare-to file was established based on the timings of the large-v3 output and named Source.txt.
A Python script (audacity_transcription_score.py) was created to process the files and generate the scores, with output directed to a file called output.md.
Certain tracks have additional files. These were named to reflect their purpose. They were not used in the analysis.
Score Generation Methodology
Scoring Algorithm
Scores were computed using Python’s difflib.SequenceMatcher class (https://docs.python.org/3/library/difflib.html).
The Python code performed three types of comparisons between the Source and Transcribed text files:
- Unprocessed - Comparison based on unaltered files including timings. These results are inherently biased since the source-compare-to file is based on large-v3 timings. This metric was included primarily to illustrate the structural differences in output between models.
- No Timings/LCase/no NL - Comparison after removing Audacity-generated timings, converting text to lowercase, and consolidating the output into a single string. This normalisation was necessary because, different models transcribed the same text with varying capitalisation.
- No Timings/LCase/no NL Punct - Further processing that also removes period, comma and double quote characters from the input.
All processing outputs were displayed on screen and written to the file output.md.
Results
Track 1 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 7.00 | 21.10 | 100.00 |
small | 41.23 | 15.58 | 100.00 |
medium | 60.13 | 82.56 | 100.00 |
large-v3 | 99.26 | 97.11 | 100.00 |
Track 2 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 14.00 | 11.09 | 12.08 |
small | 32.76 | 23.45 | 96.65 |
medium | 40.66 | 88.21 | 100.00 |
large-v3 | 53.82 | 65.22 | 100.00 |
Track 3 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 31.36 | 95.16 | 94.98 |
small | 29.33 | 96.16 | 97.26 |
medium | 32.57 | 96.25 | 97.16 |
large-v3 | 89.76 | 83.56 | 84.13 |
Track 4 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 9.28 | 16.75 | 17.77 |
small | 5.56 | 13.39 | 16.63 |
medium | 1.63 | 4.62 | 9.90 |
large-v3 | 50.69 | 16.41 | 23.08 |
Track 5 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 30.11 | 42.71 | 69.38 |
small | 15.91 | 63.89 | 74.65 |
medium | 12.44 | 38.58 | 49.73 |
large-v3 | 29.37 | 73.48 | 83.77 |
Track 6 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 19.75 | 4.52 | 4.82 |
small | 42.97 | 33.55 | 25.58 |
medium | 40.99 | 33.32 | 23.95 |
large-v3 | 75.88 | 4.78 | 23.09 |
Track 7 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 9.53 | 10.15 | 10.58 |
small | 31.68 | 31.47 | 95.70 |
medium | 37.88 | 31.40 | 95.53 |
large-v3 | 85.20 | 28.11 | 34.72 |
Track 8 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 36.73 | 96.19 | 99.70 |
small | 48.91 | 96.68 | 99.89 |
medium | 22.04 | 98.26 | 99.84 |
large-v3 | 97.52 | 80.75 | 95.19 |
Track 9 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 19.87 | 3.33 | 28.80 |
small | 11.95 | 6.14 | 47.21 |
medium | 24.74 | 11.50 | 37.50 |
large-v3 | 97.39 | 32.57 | 99.32 |
Track 10 | Unprocessed | No Timings/LCase/no NL | No Timings/LCase/no NL Punct |
---|---|---|---|
base | 13.95 | 5.26 | 8.94 |
small | 32.33 | 31.43 | 41.79 |
medium | 20.57 | 37.92 | 47.58 |
large-v3 | 91.18 | 42.20 | 47.66 |
Key Observations based on the No Timings/LCase/no NL Punct scores
Using scores derived from text with periods, commas, and double quotes removed may be interpreted as not accurately reflecting the transcription process. While this is an important consideration, all tracks—except for Track 1—lack the source text. Anyone wishing to account for these characters could perform a similar analysis based on the No Timings/LCase/no NL scoring.
The removal of end-of-paragraph markers was justified by the fact that Audacity does not account for them during the transcription process.
Average Score per Model
Model | Average Score |
---|---|
base | 44.71 |
small | 69.54 |
medium | 66.12 |
large-v3 | 69.10 |
Top scorer: 🔸 small
Lowest scorer: 🔹 base
Average Duration per Model
Model | Average Duration (seconds) |
---|---|
base | 5.86 |
small | 15.45 |
medium | 39.03 |
large-v3 | 81.33 |
Without Track 1 (to reduce skewing)
Track 1 was the only track in which all the models scored 100. Since this was not observed in any of the other tracks, it was considered that this track might not accurately reflect real-world transcription processing and could be skewing the data.
The same analysis was then performed on the remaining nine tracks.
Average Score per Model (Tracks 2–10)
Model | Average Score |
---|---|
base | 38.24 |
small | 66.93 |
medium | 61.11 |
large-v3 | 65.22 |
Top scorer: 🔸 small
Lowest scorer: 🔹 base
Average Duration per Model
Model | Average Duration (seconds) |
---|---|
base | 6.96 |
small | 16.54 |
medium | 45.27 |
large-v3 | 88.62 |
Conclusions
Large-v3 isn’t always the best, despite longest processing duration.
Base model’s performance is consistently poor.
Some tracks (e.g., 6, 9, 10) are challenging across models. (It would be interesting to understand what are the factors that cause this.)
Comments
Post a Comment