Jin Zhao

Computer Science Ph.D Student@Brandeis

Teaching

Course Materials

COSI-231A: Statistical Approaches to Natural Language Processing

Brandeis University, Fall 2024

I designed programming assignments for this graduate-level NLP course. Below is a sample assignment where students implement core Transformer components from scratch.

Programming Assignment: Transformer Encoder for Discourse Relation Classification

Course: COSI-231A Statistical Approaches to Natural Language Processing Term: Fall 2024 Due Date: November 26, 2024 Points: 100

Overview

In this assignment, you will implement a simplified Transformer encoder from scratch and apply it to discourse relation classification using the Penn Discourse Treebank (PDTB) dataset. By completing this assignment, you will gain hands-on experience with the core components of the Transformer architecture, including:

Sinusoidal positional embeddings for encoding sequential information
Multi-head self-attention mechanism
Feedforward neural networks within the Transformer block

Understanding these fundamental building blocks is essential for working with modern NLP models such as BERT, GPT, and their variants.

Learning Objectives

Upon successful completion of this assignment, students will be able to:

Explain the role of positional embeddings in Transformer architectures and implement sinusoidal positional encoding
Implement the scaled dot-product attention mechanism and extend it to multi-head attention
Build a complete Transformer encoder block and apply it to a text classification task
Analyze the impact of architectural choices (e.g., number of attention heads, positional embeddings) on model performance
Conduct systematic experiments and report findings in a clear, scientific manner

Background: Discourse Relation Classification

Discourse relations describe how two text segments (arguments) are logically connected. For example:

Arg1	Arg2	Relation
“It was raining heavily”	“We decided to stay home”	Cause-Result
“The company profits increased”	“However, employee satisfaction dropped”	Contrast

The PDTB corpus annotates such relations between adjacent text spans in Wall Street Journal articles. In this assignment, we classify relations at Level 2 of the PDTB sense hierarchy, which includes categories such as:

Comparison.Contrast
Contingency.Cause.Reason
Contingency.Cause.Result
Expansion.Conjunction
Expansion.Instantiation
Temporal.Asynchronous
And others…

Dataset

The PDTB dataset is provided in JSON Lines format with train/dev/test splits:

data/pdtb/
├── train.json    # Training set
├── dev.json      # Development set (for hyperparameter tuning)
└── test.json     # Test set (for final evaluation)

Each line in the JSON files contains a discourse relation with the following fields:

Field	Description
`Arg1`	First argument text span
`Arg2`	Second argument text span
`Connective`	Discourse connective (may be empty for implicit relations)
`Sense`	Discourse relation label(s)
`Type`	Relation type (Explicit, Implicit, EntRel, etc.)

Project Structure

PA4_starter_code/
├── README.md                 # This file
├── data/
│   └── pdtb/                 # PDTB dataset
│       ├── train.json
│       ├── dev.json
│       └── test.json
└── starter_src/
    ├── run.py                # Main training script
    ├── model.py              # Model architecture (TO IMPLEMENT)
    └── corpus.py             # Data loading utilities (TO IMPLEMENT)

Implementation Tasks (50 points total)

Task 1: Data Loading (`corpus.py`) — 15 points

Complete the PDTBDataset class to create a PyTorch Dataset for the PDTB data.

Requirements:

Implement __init__(): Store relations, vocabulary, max length, and label mapping
Implement __len__(): Return the number of samples
Implement __getitem__():
- Concatenate Arg1, Arg2, and Connective with special tokens: [SOS] + Arg1 + [SEP] + Arg2 + [SEP] + Connective + [EOS]
- Encode tokens to indices using the provided vocabulary
- Pad or truncate sequences to max_len
- Return (label_tensor, text_tensor) tuple

Task 2: Sinusoidal Positional Embedding (`model.py`) — 10 points

Implement the forward() method of the PositionalEmbedding class.

Mathematical Formulation:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Implementation Steps:

Compute half_dim = hidden_size // 2
Create frequency decay factors using exponential decay
Multiply positions by frequencies to get phase values
Apply sine to even dimensions and cosine to odd dimensions
Concatenate to form the full positional embedding

Task 3: Multi-Head Self-Attention (`model.py`) — 15 points

Implement the forward() method of the MultiHeadSelfAttention class.

Implementation Steps:

Apply linear transformations to queries, keys, and values
Reshape tensors to split the embedding dimension across multiple heads

Compute scaled dot-product attention:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Apply the attention mask to prevent attending to padding tokens
Compute weighted sum of values using attention weights
Concatenate heads and apply final linear projection

Important: You must implement this from scratch. Using torch.nn.MultiheadAttention is not allowed.

Task 4: Transformer Encoder (`model.py`) — 10 points

Implement the forward() method of the MiniTransformerEncoder class.

Implementation Steps:

Embed input tokens using the embedding layer
Generate and add positional embeddings
Create attention mask for padding tokens (True for real tokens, False for padding)
Apply multi-head self-attention
Pass through feedforward block
Apply final classification layer

Experiments (50 points total)

Experimental analysis is a critical component of this assignment. You must conduct the following experiments, report quantitative results, and provide thoughtful analysis.

Experiment 1: Baseline Without Positional Embeddings — 15 points

Train the model without positional embeddings to establish a baseline.

Deliverables:

Report training curves (loss over epochs)
Report final accuracy on dev and test sets
Provide qualitative analysis:
- Why might position information be important for discourse relation classification?
- What patterns might the model miss without positional information?
- Examine specific examples where the model fails

Metric	Dev Set	Test Set
Accuracy	___ %	___ %
Loss	___	___

Experiment 2: With Positional Embeddings — 15 points

Train the model with sinusoidal positional embeddings and compare against the baseline.

Deliverables:

Report training curves (overlay with Experiment 1 for comparison)
Report final accuracy on dev and test sets
Provide comparative analysis:
- Quantify the improvement from adding positional embeddings
- Explain why positional embeddings help for this specific task
- Discuss whether certain discourse relation types benefit more than others
- Visualize attention patterns if possible

Metric	Without Pos. Emb.	With Pos. Emb.	Improvement
Dev Accuracy	___ %	___ %	+___ %
Test Accuracy	___ %	___ %	+___ %

Experiment 3: Varying Number of Attention Heads — 20 points

Investigate how the number of attention heads affects model performance.

Requirements:

Test at least 4 different configurations (e.g., heads = 1, 2, 4, 8)
Keep all other hyperparameters constant for fair comparison
Run each configuration for the same number of epochs

Deliverables:

Table comparing all configurations
Training curves for each configuration
Analysis addressing:
- What is the optimal number of heads for this task?
- Is there a point of diminishing returns?
- How does head count affect training speed/stability?
- Theoretical explanation: What do multiple heads capture that a single head cannot?

Heads	Parameters	Dev Accuracy	Test Accuracy	Training Time
1	___	___ %	___ %	___ min
2	___	___ %	___ %	___ min
4	___	___ %	___ %	___ min
8	___	___ %	___ %	___ min

Bonus Experiment (Optional) — Up to 10 extra points

Conduct additional experiments of your choosing. Examples:

Different learning rate schedules
Layer normalization placement (pre-norm vs. post-norm)
Different activation functions
Attention visualization and interpretation
Error analysis by discourse relation type
Comparison with different sequence lengths

Hyperparameters

Default hyperparameters are provided as a starting point:

Parameter	Default Value
Learning Rate	1e-4
Epochs	25-50
Batch Size	128
Max Sequence Length	60
Hidden Size	256
Attention Heads	2
Sense Level	2 (second level of PDTB hierarchy)

You are encouraged to tune these parameters and report your findings.

Submission Requirements

Submit the following files:

Code (50 points)
- corpus.py — Completed data loading implementation
- model.py — Completed model implementation
- run.py — Training script (modify as needed)
- All code should be well-commented and runnable
Experimental Report (50 points)
- PDF document (4-6 pages recommended) containing:
  - Implementation Notes (5 pts): Brief description of your approach and any design decisions
  - Experiment 1 Results & Analysis (15 pts): Baseline without positional embeddings
  - Experiment 2 Results & Analysis (15 pts): With positional embeddings + comparison
  - Experiment 3 Results & Analysis (20 pts): Attention head ablation study
  - Conclusion (5 pts): Summary of findings, challenges, and lessons learned
  - Bonus experiments (up to 10 extra pts): Additional investigations

Evaluation Criteria

Code (50 points)

Component	Points	Criteria
`PDTBDataset` class	15	Correct implementation of `__init__`, `__len__`, `__getitem__`; proper tokenization and padding
`PositionalEmbedding`	10	Correct sinusoidal encoding formula; proper tensor shapes
`MultiHeadSelfAttention`	15	Correct attention computation, scaling, masking, and multi-head split/merge
`MiniTransformerEncoder`	10	Proper integration of all components; correct forward pass

Experiments & Report (50 points)

Component	Points	Criteria
Experiment 1	15	Complete baseline results; thoughtful analysis of position importance
Experiment 2	15	Clear comparison with baseline; insightful discussion of positional embedding benefits
Experiment 3	20	Thorough ablation with 4+ configurations; quantitative comparison; theoretical justification
Report Quality	—	Clear writing, professional formatting, reproducible results (embedded in above scores)

Bonus (up to 10 extra points)

Component	Points	Criteria
Additional experiments	5-10	Novel experiments with meaningful analysis and insights

Getting Started

Environment Setup

# Required packages
pip install torch numpy tqdm

Running the Code

cd starter_src
# Update data path in run.py
python run.py

Tips for Success

Start early — Debugging neural network code takes time
Test incrementally — Verify each component before integrating
Use small batches first — Debug with batch_size=2 to catch shape errors
Monitor training — Plot loss curves to diagnose training issues
Read the docstrings — Detailed guidance is provided in the starter code

Academic Integrity

This is an individual assignment. You may:

Discuss concepts and approaches with classmates
Reference PyTorch documentation and tutorials
Use the provided starter code

You may not:

Share or copy code with/from other students
Use pre-built Transformer implementations (e.g., HuggingFace)
Use code from online sources without attribution

Resources

Attention Is All You Need (Vaswani et al., 2017) — Original Transformer paper
The Illustrated Transformer — Visual explanation
PyTorch Documentation

Jin Zhao

Teaching

Course Materials

COSI-231A: Statistical Approaches to Natural Language Processing

Programming Assignment: Transformer Encoder for Discourse Relation Classification

Overview

Learning Objectives

Background: Discourse Relation Classification

Dataset

Project Structure

Implementation Tasks (50 points total)

Task 1: Data Loading (corpus.py) — 15 points

Task 2: Sinusoidal Positional Embedding (model.py) — 10 points

Task 3: Multi-Head Self-Attention (model.py) — 15 points

Task 4: Transformer Encoder (model.py) — 10 points

Experiments (50 points total)

Experiment 1: Baseline Without Positional Embeddings — 15 points

Experiment 2: With Positional Embeddings — 15 points

Experiment 3: Varying Number of Attention Heads — 20 points

Bonus Experiment (Optional) — Up to 10 extra points

Hyperparameters

Submission Requirements

Evaluation Criteria

Code (50 points)

Experiments & Report (50 points)

Bonus (up to 10 extra points)

Getting Started

Environment Setup

Running the Code

Tips for Success

Academic Integrity

Resources

Task 1: Data Loading (`corpus.py`) — 15 points

Task 2: Sinusoidal Positional Embedding (`model.py`) — 10 points

Task 3: Multi-Head Self-Attention (`model.py`) — 15 points

Task 4: Transformer Encoder (`model.py`) — 10 points