2025-07-11

Bert CLS token Study Notes

BERT [CLS] Token - Study Notes

BERT Output Structure

BERT outputs hidden representations for each input token position
Example: [CLS] hello world [SEP] → 4 hidden vectors (one per token)
Each position gets a contextual representation

What is the [CLS] Token?

Purpose: Designated position for sequence-level information aggregation
Mechanism: Uses self-attention to “see” and combine info from all other tokens
Design: Has no inherent meaning, so it’s free to learn task-specific representations

Key Point: Cannot Use [CLS] Directly

❌ Pre-trained [CLS] won’t work for your task

Pre-trained [CLS] is optimized for “Next Sentence Prediction”
NOT optimized for sentiment, classification, or other downstream tasks

How to Use [CLS] Properly

✅ Fine-tuning is required

Add task-specific head:
- Take [CLS] hidden state → feed to classification layer (linear + softmax)
Fine-tune entire model:
- New classification head (starts random)
- Pre-trained BERT parameters (including [CLS] behavior)
Result: [CLS] learns to aggregate info relevant to YOUR specific task

Summary

[CLS] = learnable sequence-level aggregation point
Pre-trained [CLS] ≠ ready for your task
Fine-tuning required to make [CLS] useful for classification