Build an AI Code Review Tool with API Integration

Create a Python-based code review assistant that uses LLM API calls to analyze code for bugs, security issues, style violations, and improvement opportunities.

20 min read
3 quiz questions

Why AI Code Review Matters

Manual code reviews are essential but slow. Reviewers get fatigued, miss edge cases, and often focus on style over substance. AI code review automation does not replace human reviewers — it augments them by catching the obvious issues before a human ever looks at the code. This means human reviewers can focus on architecture decisions, business logic, and design patterns instead of spotting missing null checks.

In this project, you will build a Python tool that sends code to an LLM API and receives structured review feedback. You will design specialized prompts for different review concerns (bugs, security, readability) and learn how to parse and format the AI's feedback. This is a real tool you can integrate into your development workflow.

Project

intermediate60 min

Project Overview

Build a Python-based AI code review tool that uses LLM APIs to analyze code. You will create prompts for bug detection, security analysis, and code quality review, then wire them together with API calls.
ChatGPTClaudePython

Setting Up the Foundation

The tool uses the OpenAI Python SDK (which also works with compatible APIs). Here is the basic structure that sends code to an LLM and gets a review back. This forms the backbone of the tool — everything else builds on this pattern.

import openai
import json
from pathlib import Path

client = openai.OpenAI()  # Uses OPENAI_API_KEY env var

def review_code(code: str, filename: str, review_type: str = "general") -> dict:
    """Send code to LLM for review and return structured feedback."""
    
    system_prompt = get_review_prompt(review_type)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Review this file ({filename}):\n\n```\n{code}\n```"}
        ],
        response_format={"type": "json_object"},
        temperature=0.1  # Low temperature for consistent, precise analysis
    )
    
    return json.loads(response.choices[0].message.content)


def get_review_prompt(review_type: str) -> str:
    """Return the appropriate system prompt for the review type."""
    prompts = {
        "general": GENERAL_REVIEW_PROMPT,
        "security": SECURITY_REVIEW_PROMPT,
        "bugs": BUG_DETECTION_PROMPT,
    }
    return prompts.get(review_type, GENERAL_REVIEW_PROMPT)
We set temperature to 0.1 for code review. You want consistent, precise analysis — not creative interpretation. Low temperature means the model picks the most likely tokens, reducing randomness in its assessments.

Prompt 1: General Code Quality Review

The general review prompt is the workhorse of the tool. It evaluates readability, naming, structure, and common anti-patterns. The key to making this work well is asking for structured JSON output with severity levels, line numbers, and specific suggestions — not just "this code could be better."

General Code Review Prompt

The system prompt for general code quality reviews. It enforces structured JSON output with severity levels and actionable suggestions.

You are an expert code reviewer. Analyze the provided code and return a JSON object with the following structure:

{
  "summary": "One-paragraph overall assessment",
  "score": 1-10,
  "issues": [
    {
      "severity": "critical" | "warning" | "suggestion",
      "line": <line number or null>,
      "category": "bug" | "readability" | "performance" | "naming" | "structure" | "duplication",
      "description": "What the issue is",
      "suggestion": "Specific fix or improvement",
      "code_before": "problematic code snippet",
      "code_after": "suggested replacement"
    }
  ],
  "positives": ["List of things done well"]
}

Rules:
- Be specific: include line numbers and code snippets in every issue
- Prioritize: critical issues first, suggestions last
- Be constructive: every issue must include a concrete suggestion
- Recognize good patterns: the "positives" list matters for morale
- Do not flag style preferences (tabs vs spaces, bracket placement) — focus on substance
Best with: GPT-4o / Claude

Prompt 2: Security-Focused Review

Security review requires a different lens. The prompt needs to check for OWASP top 10 vulnerabilities, injection risks, authentication flaws, and data exposure. Security prompts should be explicit about what to look for — LLMs are better at checking a list than doing an open-ended "find security issues" search.

Security Review Prompt

A security-focused review prompt that checks for OWASP vulnerabilities and returns structured findings with remediation steps.

You are a security-focused code reviewer. Analyze the provided code for security vulnerabilities.

Check for ALL of the following:
1. **Injection** — SQL injection, command injection, XSS, template injection
2. **Authentication/Authorization** — missing auth checks, privilege escalation, insecure session handling
3. **Data exposure** — logging sensitive data, hardcoded secrets, PII in error messages
4. **Input validation** — missing or insufficient validation, type confusion
5. **Cryptography** — weak algorithms, hardcoded keys, improper random number generation
6. **Dependencies** — known vulnerable patterns, unsafe deserialization

Return a JSON object:
{
  "risk_level": "high" | "medium" | "low" | "none",
  "vulnerabilities": [
    {
      "severity": "critical" | "high" | "medium" | "low",
      "type": "OWASP category or CWE ID",
      "line": <line number>,
      "description": "What the vulnerability is and how it could be exploited",
      "remediation": "Exact code change to fix this",
      "references": ["Link to relevant documentation"]
    }
  ],
  "secure_practices_found": ["List of security best practices already in the code"]
}

IMPORTANT: Do not flag theoretical risks with no practical exploit path. Every vulnerability must include a realistic attack scenario.
Best with: GPT-4o / Claude

Prompt 3: Bug Detection

Bug Detection Prompt

Focuses exclusively on finding bugs, logic errors, and unhandled edge cases. Returns findings with specific trigger conditions.

You are a bug-hunting code reviewer. Your job is to find logic errors, edge cases, and runtime failures.

Analyze the code for:
1. **Off-by-one errors** in loops, array access, and string manipulation
2. **Null/undefined handling** — variables that could be null but are not checked
3. **Race conditions** — concurrent access to shared state
4. **Resource leaks** — unclosed files, connections, or streams
5. **Type mismatches** — implicit conversions that could fail
6. **Edge cases** — empty inputs, very large inputs, negative numbers, unicode
7. **Logic errors** — conditions that are always true/false, unreachable code, wrong operators

Return a JSON object:
{
  "bugs_found": [
    {
      "severity": "critical" | "probable" | "possible",
      "line": <line number>,
      "description": "What the bug is",
      "trigger": "Specific input or condition that would trigger this bug",
      "fix": "Code to fix the issue"
    }
  ],
  "edge_cases_to_test": ["List of edge cases the code should be tested against"]
}
Best with: GPT-4o / Claude

Wiring It Together: The CLI Tool

Now let's combine the prompts into a usable command-line tool. The CLI accepts a file path and review type, sends the code for review, and outputs formatted results.

import argparse
import sys

def format_issues(review: dict) -> str:
    """Format review results for terminal output."""
    output = []
    
    if "summary" in review:
        output.append(f"\n📋 Summary: {review['summary']}")
        output.append(f"   Score: {review.get('score', 'N/A')}/10\n")
    
    if "risk_level" in review:
        output.append(f"\n🔒 Risk Level: {review['risk_level'].upper()}\n")
    
    # Format issues/vulnerabilities/bugs
    items = review.get("issues", review.get("vulnerabilities", review.get("bugs_found", [])))
    
    severity_icons = {
        "critical": "🔴", "high": "🟠",
        "warning": "🟡", "probable": "🟡", "medium": "🟡",
        "suggestion": "🔵", "possible": "🔵", "low": "🔵"
    }
    
    for i, item in enumerate(items, 1):
        sev = item.get("severity", "info")
        icon = severity_icons.get(sev, "⚪")
        line = f" (line {item['line']})"
            if item.get("line") else ""
        output.append(f"{icon} #{i} [{sev.upper()}]{line}")
        output.append(f"   {item.get('description', '')}")
        fix = item.get("suggestion", item.get("fix", item.get("remediation", "")))
        if fix:
            output.append(f"   💡 Fix: {fix}")
        output.append("")
    
    # Positives
    for positive in review.get("positives", review.get("secure_practices_found", [])):
        output.append(f"✅ {positive}")
    
    return "\n".join(output)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="AI Code Reviewer")
    parser.add_argument("file", help="Path to the file to review")
    parser.add_argument(
        "--type", "-t",
        choices=["general", "security", "bugs"],
        default="general",
        help="Type of review to perform"
    )
    args = parser.parse_args()
    
    code = Path(args.file).read_text()
    print(f"Reviewing {args.file} ({args.type} review)...\n")
    
    result = review_code(code, args.file, args.type)
    print(format_issues(result))

Extending the Tool

Once the basic tool works, there are several ways to make it more powerful:

  • Run all three review types in parallel using Python's asyncio and aggregate the results
  • Add a --diff flag that reviews only changed lines from a git diff instead of the full file
  • Integrate with GitHub Actions to run the review automatically on every pull request
  • Add a --fix flag that applies suggested fixes automatically (with confirmation)
  • Cache results so re-running on unchanged files is instant
Token cost matters for code review. A 500-line file is roughly 2,000-4,000 tokens. Running all three review types on a 10-file PR could cost $0.10-0.50 with GPT-4o. Consider using a cheaper model for initial screening and GPT-4o only for flagged files.

Best Practices for AI Code Review

  1. Use structured JSON output — it makes results parseable and consistent across reviews
  2. Set temperature low (0.0-0.2) — you want precision, not creativity
  3. Include positive feedback — reviewers and authors both need to know what is going well
  4. Require specific line numbers — vague feedback like "consider improving error handling" is useless
  5. Validate the AI's findings — AI code review is a first pass, not the final word. Human review is still essential.

Test Your Knowledge

Knowledge Check

1 / 3

Why should you set temperature to a low value (0.0-0.2) for AI code review?

Key Takeaways

  • AI code review augments human reviewers by catching obvious issues before a human ever looks at the code
  • Use specialized prompts for different review types (general, security, bugs) rather than one combined prompt
  • Require structured JSON output with line numbers and severity levels for parseable, actionable feedback
  • Set temperature low (0.0-0.2) for consistent, precise analysis
  • Always validate AI findings with human judgment — AI review is a first pass, not the final word
  • Consider token costs: review only changed lines (diffs) when possible, and use cheaper models for initial screening