Accessibility Automation Guidelines: What to Automate vs. What Requires Human Testing

One of the most common questions I get after my “Inclusive by Default” talks is: “Which accessibility checks should we automate first?”

The answer isn’t “automate everything”—and it’s definitely not “accessibility can’t be automated.” The truth is strategic: some checks are perfect for automation, others require human judgment, and knowing the difference will save you time while improving your coverage.

This guide provides a practical priority matrix for accessibility testing automation, helping you decide what to automate in your CI/CD pipeline versus what needs manual testing or user research.

The Automation Reality Check

Let’s start with an honest assessment:

Automated tools can catch approximately 30-40% of accessibility issues.

That sounds low, but here’s the key insight: that 30-40% includes the most common, most easily fixable issues—the low-hanging fruit that creates the foundation for accessibility.

The remaining 60-70% requires human judgment: Is this alt text meaningful? Is this navigation order logical? Does this interaction pattern make sense?

The Priority Matrix

I’ve organized accessibility checks into four quadrants based on two factors:

Automation reliability (Can a tool reliably detect this?)
Impact (How critical is this for users?)

Quadrant 1: Automate First (High Reliability + High Impact)

These checks are reliable, impactful, and should run on every build.

Check	WCAG Criterion	Why Automate
Missing accessible names	4.1.2	100% detectable—element either has a name or doesn’t
Missing form labels	1.3.1, 3.3.2	Programmatically determinable
Color contrast ratios	1.4.3	Mathematical calculation
Invalid ARIA attributes	4.1.2	Validatable against spec
Duplicate IDs	4.1.1	Simple DOM check
Missing language attribute	3.1.1	Presence check on `<html>`
Missing page titles	2.4.2	Presence check on `<title>`
Broken ARIA references	4.1.2	ID existence validation

Implementation:

# conftest.py - Run these on EVERY test
import pytest
from axe_selenium_python import Axe

@pytest.fixture(autouse=True)
def check_critical_accessibility(driver, request):
    """Automatically check critical accessibility on every test."""
    yield  # Run the test first

    # Only check on page-level tests (not API tests, etc.)
    if hasattr(request, 'param') and request.param.get('skip_a11y'):
        return

    axe = Axe(driver)
    axe.inject()

    # Check only the most reliable, high-impact rules
    results = axe.run(options={
        'runOnly': {
            'type': 'rule',
            'values': [
                'label',           # Form labels
                'button-name',     # Button accessible names
                'link-name',       # Link accessible names
                'image-alt',       # Image alt text presence
                'color-contrast',  # Color contrast
                'aria-valid-attr', # Valid ARIA attributes
                'duplicate-id',    # Duplicate IDs
                'html-has-lang',   # Language attribute
                'document-title',  # Page title
            ]
        }
    })

    violations = results['violations']
    if violations:
        # Fail the test with clear details
        violation_summary = "\n".join([
            f"- {v['id']}: {v['description']} ({len(v['nodes'])} instances)"
            for v in violations
        ])
        pytest.fail(
            f"Critical accessibility violations found:\n{violation_summary}"
        )

Quadrant 2: Automate with Caution (Medium Reliability + High Impact)

These can be automated but produce false positives. Use them for flagging, not failing builds.

Check	WCAG Criterion	Challenge
Heading hierarchy	1.3.1	Tool detects skipped levels, but some designs legitimately skip
Link purpose	2.4.4	Tool detects “click here,” but context matters
Focus order	2.4.3	Tool can trace order, but logic requires judgment
Keyboard traps	2.1.2	Detection is possible, but false positives in modals
Autocomplete attributes	1.3.5	Tool can check presence, but appropriateness varies

Implementation:

# helpers/cautious_checks.py

def check_heading_hierarchy(driver) -> dict:
    """
    Check heading hierarchy - flags issues but explains context.

    Returns warnings, not failures, because skipped headings
    may be intentional in some designs.
    """
    script = """
    const headings = document.querySelectorAll('h1, h2, h3, h4, h5, h6');
    const levels = Array.from(headings).map(h => parseInt(h.tagName[1]));

    const issues = [];
    let prevLevel = 0;

    levels.forEach((level, index) => {
        // Check for skipped levels (h1 -> h3 without h2)
        if (level > prevLevel + 1 && prevLevel !== 0) {
            issues.push({
                type: 'skipped_level',
                from: prevLevel,
                to: level,
                element: headings[index].outerHTML.substring(0, 100),
                message: `Heading level skipped from h${prevLevel} to h${level}`
            });
        }
        prevLevel = level;
    });

    // Check for multiple h1s (usually wrong, but not always)
    const h1Count = levels.filter(l => l === 1).length;
    if (h1Count > 1) {
        issues.push({
            type: 'multiple_h1',
            count: h1Count,
            message: `Found ${h1Count} h1 elements (typically should be 1)`
        });
    }

    // Check for no h1 at all
    if (h1Count === 0) {
        issues.push({
            type: 'no_h1',
            message: 'No h1 element found on page'
        });
    }

    return {
        headings: levels,
        issues: issues,
        recommendation: issues.length > 0
            ? 'Review heading structure manually'
            : 'Heading structure appears correct'
    };
    """
    return driver.execute_script(script)


def warn_on_heading_issues(driver):
    """
    Issue warnings (not failures) for heading structure issues.

    Use in CI to flag for review without blocking deployment.
    """
    results = check_heading_hierarchy(driver)

    if results['issues']:
        import warnings
        for issue in results['issues']:
            warnings.warn(
                f"Accessibility Review Needed: {issue['message']}",
                UserWarning
            )

    return results

Quadrant 3: Manual Testing Required (Low Reliability + High Impact)

These are critical but cannot be reliably automated. Schedule regular manual reviews.

Check	WCAG Criterion	Why Manual
Meaningful alt text	1.1.1	Tool checks presence, not meaning
Logical reading order	1.3.2	Requires understanding content
Consistent navigation	3.2.3	Requires cross-page comparison
Error identification	3.3.1	Requires understanding context
Input purpose	1.3.5	Requires understanding field purpose
Resize/reflow	1.4.4, 1.4.10	Requires visual inspection
Motion/animation safety	2.3.1, 2.3.3	Requires content understanding

Implementation Strategy:

## Manual Accessibility Testing Checklist

### Before Each Release

Run through these checks on key user journeys:

#### Alt Text Quality (WCAG 1.1.1)

- [ ] Do images have alt text that describes their purpose?
- [ ] Are decorative images marked with `alt=""`?
- [ ] Do complex images have extended descriptions?

#### Reading Order (WCAG 1.3.2)

- [ ] Does content make sense when CSS is disabled?
- [ ] Is the DOM order logical for screen readers?
- [ ] Do modals/overlays announce in correct sequence?

#### Error Messages (WCAG 3.3.1)

- [ ] Are errors clearly described?
- [ ] Is it clear how to fix each error?
- [ ] Are errors associated with their fields?

#### Cognitive Load

- [ ] Is navigation consistent across pages?
- [ ] Are instructions clear and not overwhelming?
- [ ] Can users easily recover from mistakes?

Quadrant 4: User Research Required (Low Reliability + Variable Impact)

These require testing with actual users who have disabilities.

Check	WCAG Criterion	Why User Testing
Screen reader experience	Multiple	Real-world usage patterns
Cognitive accessibility	3.1.x, 3.3.x	User understanding varies
Motor accessibility	2.1.x	Individual needs vary
Assistive technology compatibility	4.1.x	AT behavior varies

Implementation Strategy:

## User Testing Program

### Quarterly User Testing Sessions

Partner with users who rely on:

- Screen readers (JAWS, NVDA, VoiceOver)
- Voice control (Dragon, Voice Control)
- Switch devices
- Screen magnification

### Key Questions

1. Can you complete the core user journey?
2. Where did you encounter friction?
3. What was confusing or unclear?
4. What worked well?

### Metrics to Track

- Task completion rate
- Time to complete tasks
- Error rate
- Satisfaction score
- Qualitative feedback themes

Building Your Automation Pipeline

Here’s how to structure your CI/CD pipeline with these quadrants in mind:

Stage 1: Build-Time Checks (Blocking)

# .github/workflows/accessibility.yml

name: Accessibility CI

on: [push, pull_request]

jobs:
  accessibility-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: npm ci

      - name: Run critical accessibility tests
        run: npm run test:a11y:critical
        # These tests FAIL the build:
        # - Missing form labels
        # - Missing button/link names
        # - Color contrast below 4.5:1
        # - Invalid ARIA
        # - Duplicate IDs

      - name: Run cautious accessibility checks
        run: npm run test:a11y:review
        continue-on-error: true # Don't fail, but report
        # These tests WARN but don't fail:
        # - Heading hierarchy
        # - Link text quality
        # - Focus order suggestions

Stage 2: Pre-Release Checks (Warning)

accessibility-review:
  runs-on: ubuntu-latest
  if: github.ref == 'refs/heads/main'
  steps:
    - name: Generate accessibility report
      run: npm run test:a11y:full

    - name: Upload report
      uses: actions/upload-artifact@v3
      with:
        name: accessibility-report
        path: reports/accessibility/

    - name: Comment on PR with findings
      uses: actions/github-script@v6
      with:
        script: |
          const report = require('./reports/accessibility/summary.json');
          const comment = `## Accessibility Review Required

          **Automated findings:** ${report.violations} issues
          **Manual review items:** ${report.reviewItems} items

          Please review the [full report](${report.reportUrl}) before release.`;

          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: comment
          });

Stage 3: Scheduled Comprehensive Audits

weekly-audit:
  runs-on: ubuntu-latest
  if: github.event_name == 'schedule'
  steps:
    - name: Run full Lighthouse audit
      run: npx lhci autorun

    - name: Run axe comprehensive scan
      run: npm run test:a11y:comprehensive

    - name: Run Pa11y on all pages
      run: npm run test:pa11y:sitemap

    - name: Create tracking issue
      if: failure()
      uses: actions/github-script@v6
      with:
        script: |
          github.rest.issues.create({
            owner: context.repo.owner,
            repo: context.repo.repo,
            title: 'Weekly Accessibility Audit - Issues Found',
            body: 'See attached report for details.',
            labels: ['accessibility', 'automated-audit']
          });

The Decision Framework

When deciding what to automate, ask these questions:

1. Can the check be expressed as a binary pass/fail?

✅ “Does this button have an accessible name?” → Automate
❌ “Is this alt text meaningful?” → Manual

2. Is the check consistent across contexts?

✅ “Is color contrast at least 4.5:1?” → Automate
❌ “Is this navigation order logical?” → Manual (context-dependent)

3. What’s the false positive rate?

< 5% false positives → Automate as blocking
5-20% false positives → Automate as warning
20% false positives → Manual review only

4. What’s the cost of missing it?

Legal/compliance risk → Automate what you can, manual review the rest
User experience impact → Prioritize based on user journey importance

Recommended Tool Stack

Based on my experience, here’s the tool combination that gives the best coverage:

Tool	Best For	Use In
axe-core	Comprehensive rule set, low false positives	CI/CD blocking
Lighthouse	Performance + accessibility combined	Weekly audits
Pa11y	Page-level scanning, sitemap crawling	Scheduled audits
WAVE	Visual feedback during development	Developer tooling
IBM Equal Access	Enterprise compliance reporting	Quarterly audits

Integration Example

# test_comprehensive_accessibility.py

import pytest
from axe_selenium_python import Axe
from pa11y import Pa11y
import subprocess
import json


class TestComprehensiveAccessibility:
    """
    Comprehensive accessibility test suite.

    Run weekly or before major releases.
    """

    def test_axe_full_scan(self, driver):
        """Run full axe-core scan with all rules."""
        driver.get("https://yoursite.com")

        axe = Axe(driver)
        axe.inject()
        results = axe.run()

        # Separate critical from moderate issues
        critical = [v for v in results['violations']
                   if v['impact'] in ['critical', 'serious']]
        moderate = [v for v in results['violations']
                   if v['impact'] in ['moderate', 'minor']]

        # Fail on critical, warn on moderate
        if critical:
            pytest.fail(f"Critical accessibility issues: {len(critical)}")

        if moderate:
            pytest.warns(f"Moderate accessibility issues: {len(moderate)}")

    def test_lighthouse_accessibility_score(self, driver):
        """Check Lighthouse accessibility score meets threshold."""
        result = subprocess.run([
            'lighthouse',
            'https://yoursite.com',
            '--output=json',
            '--only-categories=accessibility'
        ], capture_output=True, text=True)

        report = json.loads(result.stdout)
        score = report['categories']['accessibility']['score'] * 100

        assert score >= 90, f"Lighthouse accessibility score {score} below 90"

    def test_pa11y_pages(self):
        """Run Pa11y on key pages."""
        pages = [
            'https://yoursite.com/',
            'https://yoursite.com/login',
            'https://yoursite.com/checkout',
            'https://yoursite.com/contact'
        ]

        for page in pages:
            result = subprocess.run([
                'pa11y', page, '--reporter', 'json'
            ], capture_output=True, text=True)

            issues = json.loads(result.stdout)
            errors = [i for i in issues if i['type'] == 'error']

            assert len(errors) == 0, f"Pa11y errors on {page}: {len(errors)}"

Summary: Your Accessibility Automation Strategy

Automate the fundamentals (Quadrant 1) - Run on every build
Flag the uncertain (Quadrant 2) - Warn but don’t block
Schedule manual reviews (Quadrant 3) - Before each release
Invest in user testing (Quadrant 4) - Quarterly minimum

The goal isn’t 100% automation—it’s efficient coverage that catches the detectable issues automatically while reserving human attention for what truly requires human judgment.

Resources

This article is part of my “Inclusive by Default” series on building accessibility into test automation. For the technical implementation of accessibility helpers, see Building Accessibility into Your Selenium Test Automation.