Back to Papers List

Attention Is All You Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, et al.
Venue: NIPS 2017
Status: Read āœ“
Last Updated: July 2025
AI ML NLP

šŸ“‹ Summary

This paper introduces the Transformer architecture, which relies entirely on attention mechanisms and dispenses with recurrence and convolutions entirely. The model achieves state-of-the-art results on machine translation tasks while being more parallelizable and requiring significantly less time to train.

šŸ”‘ Key Contributions

šŸ—ļø Architecture Details

Encoder-Decoder Structure

The Transformer follows an encoder-decoder architecture. The encoder maps an input sequence to a sequence of continuous representations, which the decoder uses to generate an output sequence one element at a time.

Multi-Head Self-Attention

Instead of performing a single attention function with d_model-dimensional keys, values and queries, the authors linearly project the queries, keys and values h times with different learned projections.

"Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions."

Positional Encoding

Since the model contains no recurrence or convolution, positional encodings are added to input embeddings:

šŸ’” Key Insights

Attention mechanisms can completely replace recurrence and convolutions for sequence modeling tasks. This leads to:

šŸŽÆ Results

šŸ¤” My Thoughts & Analysis

Strengths

Potential Limitations

Impact on the Field

This paper essentially launched the modern era of NLP. It led directly to:

šŸ”— Related Work to Explore

šŸ“ Implementation Notes

Key implementation details to remember:

šŸŽ“ Questions for Further Study

⭐ Personal Rating: 5/5

Groundbreaking paper that fundamentally changed the field. The transformer architecture introduced here became the foundation for virtually all modern large language models. The clarity of presentation and the empirical results make this a must-read for anyone in AI/ML.


These notes were compiled while reading the paper. They represent my understanding and interpretation of the key concepts and contributions.