Writing Studio Analytics | Zachary Tipton

View on GitHub

Problem

University writing centers collect session data through platforms like Penji, but analyzing that data means handling FERPA-protected student records. Cloud-based tools introduce compliance risk. Manual analysis is slow and inconsistent. The Writing Studio needed a way to generate standardized analytics reports without student data ever leaving the machine.

Solution

A self-contained Streamlit application that:

Ingests Penji CSV/Excel exports and produces multi-section PDF reports with anonymized datasets
Detects and removes all personally identifiable information automatically
Provides an optional AI chat interface powered by a local LLM (no cloud APIs)
Ships as a portable package — no Python installation required, runs from a USB drive or shared folder

Key Features

Report Generation

Dual-mode support: Separate cleaning pipelines and report templates for 40-minute scheduled appointments and walk-in sessions
Auto-detection: Identifies session type from column patterns in the uploaded file
11-section PDF reports (scheduled) covering booking behavior, attendance outcomes, student satisfaction, tutor workload, session content, incentive analysis, data quality, and semester comparisons
Anonymized CSV export alongside the PDF for further analysis

Privacy & Security

Two-layer PII detection: Exact column name matching + regex pattern scanning for emails, SSO IDs, and names
SHA256-based anonymization: Deterministic anonymous IDs (STU_04521, TUT_0842) with collision handling
Encrypted codebook: Optional reverse-lookup file using PBKDF2-HMAC-SHA256 key derivation (100,000 iterations) and Fernet symmetric encryption. Requires a 12+ character password to decrypt, allowing supervisors to identify flagged individuals when needed
Response filtering: AI chat responses are scanned for PII leakage before display

AI Chat Assistant

Model: Gemma 3 4B Instruct (Q4_0 quantized GGUF, ~3 GB)
Inference: llama-cpp-python running locally on CPU (with optional CUDA/Metal GPU acceleration)
Safety: Weighted input scoring system with 78 off-topic patterns, 43 jailbreak detection regexes, and post-generation response filtering
Code execution: LLM can generate and run pandas/DuckDB queries against the loaded dataset in a restricted sandbox
Query logging: All prompts logged with timestamps and accept/reject status to logs/queries.log
Graceful degradation: If the model file isn't downloaded, the tab shows a one-click download button with a progress bar. If the inference library isn't installed, the tab explains what's needed. The rest of the app works regardless.

Analytics Engine

Over 1,300 lines of metrics calculations covering:

Booking behavior: Lead time distributions (same-day through 7+ days), temporal patterns by day and hour, day/time heatmaps
Attendance: Completion, cancellation, and no-show rates — overall, by location (in-person vs. Zoom), by day of week, and by semester
Student satisfaction: Pre/post confidence distributions, confidence change deltas, satisfaction ratings (1-7 scale), tutor rapport and session quality scores
Tutor workload: Sessions per tutor, average session length, balance analysis
Statistical outlier detection: IQR method with NaN preservation — outliers are removed from calculations but legitimate missing data is kept

Tech Stack

Layer	Technology	Role
UI	Streamlit	Web interface, file uploads, tab navigation
Data	pandas, NumPy, DuckDB	Cleaning, transformation, fast SQL-like queries
Visualization	Matplotlib, Seaborn, Altair	Charts for PDF reports and interactive display
PDF	ReportLab	Multi-page report generation with embedded charts
Privacy	cryptography, hashlib	Anonymization and encrypted codebook
AI	llama-cpp-python, Gemma 3 4B	Local LLM inference, no external API calls
Distribution	Embedded Python 3.11, batch scripts	Portable package, no installation required

Architecture

The project follows a modular structure with clear separation of concerns:

app.py                          Entry point (Streamlit app, ~900 lines)

src/
  core/
    data_cleaner.py             10-step scheduled session cleaning pipeline
    walkin_cleaner.py           Walk-in session cleaning pipeline
    metrics.py                  Scheduled session analytics (1,384 lines)
    walkin_metrics.py           Walk-in specific metrics
    privacy.py                  PII detection, SHA256 anonymization, Fernet encryption
    location_metrics.py         Location-based analytics

  ai_chat/
    chat_handler.py             Orchestrates queries, validation, code execution
    llm_engine.py               Gemma model loading and inference wrapper
    safety_filters.py           Input validation + response filtering
    code_executor.py            Sandboxed pandas/DuckDB code execution
    query_engine.py             Natural language to SQL translation via DuckDB
    prompt_templates.py         System prompts built from data context
    setup_model.py              Model download and system requirements check

  visualizations/
    report_generator.py         Scheduled session PDF report (11 sections)
    walkin_report_generator.py  Walk-in session PDF report (8 sections)
    charts.py                   Reusable chart functions

  utils/
    academic_calendar.py        Semester detection from dates

Design Decisions

Local-only inference

Student data never leaves the machine. The LLM runs on CPU via llama-cpp-python rather than calling an external API, which eliminates FERPA concerns around data transmission.

Deterministic anonymization

SHA256 hashing produces the same anonymous ID for the same email across runs, so longitudinal analysis works without storing raw identifiers.

Graceful degradation

Every optional feature (AI chat, walk-in mode, GPU acceleration) disables itself cleanly if its dependencies are missing, rather than crashing. The app always works at its core: upload data, get a report.

Portable Python

The target user has no development tools and limited admin access on a university-managed machine. Embedding Python and all dependencies in the distribution folder avoids installation entirely.

Weighted safety scoring

Rather than a simple keyword blocklist, the AI input validator uses a scoring system (+2 for data-relevant terms, -3 for off-topic, -5 for harmful) so that a query mentioning both data terms and an incidental flagged word isn't falsely rejected.

Distribution

The app ships as a portable folder (~600 MB zipped, no model) that runs on any Windows machine without Python installed:

WritingStudioAnalytics/
  python/          Embedded Python 3.11 + all dependencies
  src/             Source code
  models/          Empty — user downloads AI model here (~3 GB, optional)
  app.py           Main application
  launch.bat       Double-click to start

The AI model is hosted on S3 and can be downloaded from within the app with a single button click and a progress bar. The full analytics and reporting features work without it.

Key Takeaways

This project demonstrates several critical technical capabilities:

Privacy Engineering: Implementing robust PII detection, anonymization, and encryption to ensure FERPA compliance while maintaining data utility
System Architecture: Designing a modular, maintainable codebase with clear separation between data processing, visualization, AI components, and utilities
End-User Focus: Creating a solution that runs without installation or technical knowledge, addressing real user constraints and institutional environments
Local AI Integration: Successfully integrating local LLM inference with safety systems, code execution capabilities, and graceful degradation patterns
Analytics Depth: Building a comprehensive analytics engine with 1,300+ lines of metrics calculations covering diverse dimensions of writing center operations

View on GitHub

← Back to Portfolio