🏠 Home 📊 Individual Layer

David & Henning's Master Thesis on quantization of LLMs for integer-only hardware

Softmax Implementation for Integer-Only Hardware for LLM Inferencing

Test quantized softmax implementations in a real transformer model

Current Model

Loading model information...

View on HuggingFace →

Test Configuration

Select the transformer model to use for testing
Single prompt for quick tests, batch for comprehensive evaluation
Tip: The default country changes randomly on each page load
Higher values provide more comprehensive error statistics but take longer

Advanced Configuration

Configure exp(x) range and parameters for different methods

Configure the exp(x) range for different method types. Single-table methods are limited by int16 precision.

Tradeoff: Higher xmax covers more extreme values but spreads quantization precision thinner, causing larger errors on typical small values. Lower xmax gives better precision but clips extreme outliers.
Recommendation: Use 6-10 for balanced coverage. Single tables struggle with wide ranges—use DIGmax for xmax>10.
Tradeoff: Higher xmax accommodates extreme outliers with multiple tables adapting precision. Lower xmax reduces memory and improves precision for typical ranges.
Recommendation: Use 20-40 for robust handling of diverse attention patterns. DIGmax excels at wide ranges—go higher for safety without major accuracy loss.
Tradeoff: Higher orders (6-7) improve accuracy for large x but require more multiply-accumulate operations and risk numerical overflow. Lower orders (2-3) are faster but only accurate for small x values.
Recommendation: Order 5 balances accuracy and computational cost. Use Order 3-4 for ultra-low-power, Order 6-7 for research comparisons.
Tradeoff: More tables provide finer-grained range adaptation, improving accuracy across diverse attention patterns but increasing memory footprint. Fewer tables reduce memory but force precision compromises.
Recommendation: 6 tables (~3KB) is optimal for embedded systems. Use 8-12 for best accuracy, 3-4 for extreme memory constraints.
Tradeoff: Linear distribution lacks exponential adaptation, requiring many more tables (256+) to match accuracy of 6 logarithmic tables. Fewer tables severely degrade precision. More tables approach logarithmic accuracy but waste memory.
Note: Exponential/log distribution is generally superior. Use linear only for controlled benchmarking or specific hardware constraints.

Select Implementations to Test

Click cards to select/deselect implementations. Each will run separately and results will be compared.