Worksheet: Code Generation/Review Evaluation

Assessing LLM-generated code for correctness, quality, and safety
Course: Natural Language Annotation for Machine Learning Task Type: Multi-dimensional evaluation
Author: Jin Zhao

Background

LLMs can generate code from natural language prompts. Evaluating this code requires assessing multiple dimensions beyond just "does it work?"

Functional Correctness: Does the code produce correct output for all inputs?

Code Quality: Is the code readable, maintainable, and well-structured?

Safety: Does the code avoid security vulnerabilities and harmful patterns?

Evaluation Dimensions

Correctness Logic errors, edge cases, bugs

Efficiency Time/space complexity, performance

Security Vulnerabilities, unsafe patterns

Style Naming, formatting, idioms

Documentation Comments, docstrings, clarity

Part 1: Basic Correctness

Prompt:

"Write a Python function to check if a string is a palindrome."

def is_palindrome(s): return s == s[::-1]
Question 1

Evaluate this code:

Part 2: Comparing Solutions

Prompt:

"Write a function to find the nth Fibonacci number."

Solution A:

def fib(n): if n <= 1: return n return fib(n-1) + fib(n-2)

Solution B:

def fib(n): if n <= 1: return n a, b = 0, 1 for _ in range(2, n + 1): a, b = b, a + b return b
Question 2

Compare the two solutions:

DimensionSolution ASolution B
Correctness
Efficiency
Readability

Part 3: Security Issues

Prompt:

"Write a Python function to run a user-provided shell command."

import subprocess def run_command(cmd): result = subprocess.run(cmd, shell=True, capture_output=True) return result.stdout.decode()
Question 3

Identify security concerns:

Part 4: Following Instructions

Prompt:

"Write a function to sort a list of numbers in ascending order. Do not use the built-in sort function."

def sort_numbers(nums): return sorted(nums)
Question 4

How should this be evaluated?

Part 5: Code Quality

Prompt:

"Write a function to calculate the average of a list of numbers."

def f(x): t=0 for i in x:t+=i return t/len(x)
Question 5

This code is functionally correct. Rate its quality:

AspectRating
Variable naming
Formatting/style
Error handling
Documentation

Part 6: Evaluating Explanations

Prompt:

"Explain what this code does and fix any bugs:"

def binary_search(arr, target): left, right = 0, len(arr) while left < right: mid = (left + right) // 2 if arr[mid] == target: return mid elif arr[mid] < target: left = mid + 1 else: right = mid return -1

LLM Response:

"This function performs binary search on a sorted array. It has a bug: the initial value of right should be len(arr) - 1, not len(arr). This could cause an index out of bounds error."

Question 6

Evaluate the LLM's analysis:

Part 7: Group Discussion

Question 7

Compare your evaluations with your group. Where did you disagree?

Part 8: Reflection

Question 8

Why is LLM code evaluation difficult?

Key Takeaway

Code evaluation is multi-dimensional and context-dependent.

  • "Correct" code may still be poor quality, insecure, or inefficient
  • Prompt ambiguity leads to legitimate interpretation differences
  • Security and safety concerns may override functional correctness
  • Evaluation requires domain expertise and clear rubrics
  • Instruction-following and output quality are both important