Worksheet: Code Generation/Review Evaluation

Background

LLMs can generate code from natural language prompts. Evaluating this code requires assessing multiple dimensions beyond just "does it work?"

Functional Correctness: Does the code produce correct output for all inputs?

Code Quality: Is the code readable, maintainable, and well-structured?

Safety: Does the code avoid security vulnerabilities and harmful patterns?

Evaluation Dimensions

Correctness Logic errors, edge cases, bugs

Efficiency Time/space complexity, performance

Security Vulnerabilities, unsafe patterns

Style Naming, formatting, idioms

Documentation Comments, docstrings, clarity

Part 1: Basic Correctness

Prompt:

"Write a Python function to check if a string is a palindrome."

def is_palindrome(s):
    return s == s[::-1]

Question 1

Evaluate this code:

Fully correct Partially correct (some cases fail) Incorrect Depends on requirements interpretation

What edge cases might this fail?

Does the prompt specify how to handle these cases?

Part 2: Comparing Solutions

Prompt:

"Write a function to find the nth Fibonacci number."

Solution A:

def fib(n):
    if n <= 1:
        return n
    return fib(n-1) + fib(n-2)

Solution B:

def fib(n):
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

Question 2

Compare the two solutions:

Dimension	Solution A	Solution B
Correctness
Efficiency
Readability

Which is better overall? Why?

Part 3: Security Issues

Prompt:

"Write a Python function to run a user-provided shell command."

import subprocess

def run_command(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True)
    return result.stdout.decode()

Question 3

Identify security concerns:

Command injection vulnerability shell=True is dangerous No input validation Potential privilege escalation No security issues

Should the LLM have generated this code? How should we evaluate it?

Part 4: Following Instructions

Prompt:

"Write a function to sort a list of numbers in ascending order. Do not use the built-in sort function."

def sort_numbers(nums):
    return sorted(nums)

Question 4

How should this be evaluated?

Complete failure - ignored constraint Partial credit - output is correct Technically used sorted() not sort() Failed the spirit of the request

How important is instruction-following vs. producing correct output?

Part 5: Code Quality

Prompt:

"Write a function to calculate the average of a list of numbers."

def f(x):
    t=0
    for i in x:t+=i
    return t/len(x)

Question 5

This code is functionally correct. Rate its quality:

Aspect	Rating
Variable naming
Formatting/style
Error handling
Documentation

What would make this code better?

Part 6: Evaluating Explanations

Prompt:

"Explain what this code does and fix any bugs:"

def binary_search(arr, target):
    left, right = 0, len(arr)
    while left < right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid
    return -1

LLM Response:

"This function performs binary search on a sorted array. It has a bug: the initial value of right should be len(arr) - 1, not len(arr). This could cause an index out of bounds error."

Question 6

Evaluate the LLM's analysis:

Correct - identified a real bug Wrong - the code is actually correct Partially correct - depends on implementation style Misleading explanation

Is this binary search implementation correct? Explain.

Part 7: Group Discussion

Question 7

Compare your evaluations with your group. Where did you disagree?

Correctness assessment Code quality standards Security severity Efficiency vs readability Instruction-following importance

Describe one specific disagreement:

Part 8: Reflection

Question 8

Why is LLM code evaluation difficult?

Multiple dimensions to evaluate Prompts are often ambiguous Multiple valid solutions exist Requires programming expertise Hard to test all edge cases All of the above

Key Takeaway

Code evaluation is multi-dimensional and context-dependent.

"Correct" code may still be poor quality, insecure, or inefficient
Prompt ambiguity leads to legitimate interpretation differences
Security and safety concerns may override functional correctness
Evaluation requires domain expertise and clear rubrics
Instruction-following and output quality are both important