ESG Automation System | Zachary Tipton

View on GitHub

Project Overview

Large enterprises manually process thousands of utility bills annually for ESG reporting, leading to high labor costs, human error, slow reporting cycles, and expensive AI-only solutions ($10-20 per 1,000 bills). This system automates the entire workflow from bill upload to GRI-compliant PDF reports, reducing processing time from hours to seconds and costs by 95%.

Key Performance Metrics

Cost Reduction: 95% cost savings vs. traditional AI solutions ($0.50-1.00 per 1,000 bills)
Processing Speed: 2-4 seconds per bill (vs. hours of manual entry)
Accuracy: 95%+ extraction accuracy with Claude Vision fallback
Local Processing: 95% of bills processed at zero cost using Docling and Tesseract OCR

Technical Skills Demonstrated

AI Integration & Architecture

3-Tier Extraction Strategy: Designed intelligent fallback system using Docling (local) → Tesseract OCR (local) → Claude Vision API (cloud) to optimize cost and accuracy
Claude API Integration: Implemented production-grade API calls with error handling, rate limiting, and cost tracking
Vision AI: Leveraged Claude Vision for complex PDF layouts that defeat traditional OCR
Prompt Engineering: Crafted structured prompts for consistent data extraction and GRI-compliant report generation
Cost Optimization: Reduced AI processing costs by 95% through intelligent tier selection and local-first architecture

Data Engineering & Processing

PDF Processing Pipeline: Built multi-modal extraction system handling text-based PDFs, scanned documents, and complex layouts
Data Validation: Implemented comprehensive validation including hallucination detection, completeness checks, and rate sanity verification ($0.01-$5.00/kWh)
Unit Conversion: Automated kWh/MWh normalization and meter reading calculations
Batch Processing: Designed system to handle multiple bills simultaneously with parallel processing
Audit Trails: Complete extraction methodology logging for compliance verification

Production Application Development

Streamlit Web Interface: Built production-grade UI with file upload, real-time processing, and interactive dashboards
Session Management: Implemented state management for multi-bill processing and cost tracking
Error Handling: Comprehensive exception handling with user-friendly error messages and automatic fallback logic
Cloud Deployment: Configured for Streamlit Cloud with automatic dependency installation (Tesseract, Poppler)
Environment Configuration: Secure API key management using environment variables and Streamlit secrets

ESG & Compliance Domain Knowledge

GRI Standards: Implemented GRI 305-2 (Energy Indirect Emissions) compliance reporting
EPA eGRID Integration: Accurate emission factor calculations using EPA 2023 regional data
Multi-Region Support: Configured for US Average, Arkansas, California, Texas, New York, and Florida emission factors
Professional Reporting: Generated publication-ready PDF reports with methodology documentation and validation statements

Python & Software Engineering

Modular Architecture: Clean separation of concerns across extraction, calculation, validation, and reporting modules
PDF Generation: ReportLab integration for professional document creation
API Client Development: Anthropic API integration with proper error handling and response parsing
Document AI Libraries: Docling (IBM) and Tesseract OCR integration for local processing
Testing & Validation: Built comprehensive validation framework to detect extraction errors

System Architecture

The ESG Automation System uses an intelligent 3-tier extraction strategy that prioritizes cost-effectiveness while maintaining high accuracy:

Tier 1: Docling (Local Processing)

IBM's open-source document AI processes text-based PDFs locally at zero cost. Handles 85% of standard utility bills with 85-90% accuracy in 2-3 seconds.

Tier 2: Tesseract OCR (Local Processing)

Open-source OCR processes scanned/image PDFs locally at zero cost. Handles 10% of bills with 70-85% accuracy in 3-5 seconds.

Tier 3: Claude Vision API (Cloud Fallback)

Anthropic's Claude Vision API handles complex layouts when local methods fail. Processes 5% of bills at ~$0.01-0.02 per bill with 95%+ accuracy in 2-4 seconds.

Processing Pipeline

Upload & Validation: PDF uploaded via Streamlit interface, validated for format and size
Intelligent Routing: System selects optimal extraction tier based on document characteristics
Data Extraction: Utility name, account number, billing period, kWh usage extracted with structured JSON output
Quality Validation: Completeness checks, rate sanity verification, hallucination detection
Emissions Calculation: EPA eGRID factors applied based on selected region
Report Generation: GRI 305-2 compliant PDF with full methodology documentation

Business Impact

This system demonstrates practical application of AI to solve real enterprise problems:

95% Cost Reduction: From $10-20 per 1,000 bills (traditional AI) to $0.50-1.00 per 1,000 bills
Time Savings: Processing reduced from hours per bill to seconds per bill
Scalability: Handles batch processing for enterprise-scale operations
Accuracy: Eliminates human data entry errors while maintaining audit trails
Compliance: Ensures GRI 305-2 reporting standards are met consistently

Tools & Technologies

Anthropic Claude API: Vision-based PDF extraction, report generation, data validation
Docling (IBM): Local document AI for text-based PDF processing
Tesseract OCR: Local optical character recognition for scanned documents
Streamlit: Production web application framework
ReportLab: Professional PDF generation for GRI reports
Python: Core application development, data processing, API integration
EPA eGRID: Regional emission factor database

Key Takeaways

This project demonstrates several critical principles in production AI systems:

Cost-Effective AI Architecture: Strategic use of local processing (free) before cloud APIs (paid) reduces costs by 95% while maintaining quality
Intelligent Fallback Systems: Multi-tier extraction ensures reliability - when cheaper methods fail, more sophisticated (and expensive) methods take over
Production-Ready Design: Comprehensive error handling, validation, and audit trails make this system enterprise-grade
Domain Knowledge Integration: Understanding ESG compliance requirements (GRI standards, EPA factors) is as important as technical implementation
User-Centric Development: Clean interface, real-time feedback, and clear documentation make complex AI systems accessible to non-technical users

Potential Enterprise Enhancements

If moving this system into production, key improvements would include:

Prompt Caching: Anthropic's caching feature could reduce API costs by another 90%
Batch API: Process 100+ bills simultaneously with async requests
ERP Integration: SAP/SharePoint connectors for automatic bill ingestion
Extended Scope: Scope 1 (natural gas, fleet) and Scope 3 (supply chain) emissions
Multi-Standard Reporting: GRI, SASB, TCFD, CDP compliance
Anomaly Detection: AI-powered flagging of unusual consumption patterns

View on GitHub

← Back to Portfolio