In this blog post Extracting Structured Data with OpenAI for Real-World Pipelines we will turn unstructured content into trustworthy, structured JSON you can store, query, and automate against.

Whether you process invoices, support emails, resumes, or contracts, the goal is the same: capture key fields accurately and repeatably. We’ll start with a high-level view of how it works, then move into practical steps, robust prompting, and production-ready code patterns in JavaScript and Python.

What structured extraction means and why it matters

Structured extraction converts messy text (PDFs, emails, chat logs) into a predictable shape (think JSON). For example, from an invoice you might extract vendor name, invoice number, dates, totals, and line items. From a support ticket, you might extract customer, product, category, and severity.

Why it matters:

  • Search and analytics: Query fields directly instead of scraping text every time.
  • Automation: Trigger workflows when a field changes (e.g., auto-create a payment).
  • Data quality: Validate fields, enforce types, and catch anomalies early.

The technology behind it

OpenAI’s models are strong at reading context and following instructions. Two capabilities make structured extraction reliable:

  • Tool (function) calling: You define a function with a JSON Schema-like set of parameters. The model chooses to “call” that function by returning a JSON payload that conforms to your schema. This gives you typed, structured outputs.
  • JSON-only responses: You can instruct the model to return only JSON, making parsing straightforward. Pair this with validation and you get both flexibility and control.

In short, the model interprets the text, fills in your schema, and you validate the result. Deterministic structure, non-deterministic content—safely harnessed.

Design your schema first

Start by deciding exactly what you want to capture. Keep it minimal, typed, and explicit about unknowns.

Guidelines:

  • Use nullable fields when values may be missing.
  • Keep numbers as numbers, dates as dates (ISO 8601), and enums tight.
  • Avoid optional fields that you don’t need—the more focused, the better.

Prompting patterns that work

Great extraction is 50% schema, 50% instruction. A reliable pattern:

  • Tell the model what the document is and what you need.
  • State handling for unknown or conflicting information (return null, not guesses).
  • Insist on valid JSON only—no extra text.
  • Provide 1–2 short examples if your data is quirky.

Code example with tool calling (JavaScript)

This pattern uses Chat Completions with a function tool so the model returns typed arguments.

Code example with tool calling (Python)

Validate and post-process

Always validate model output before using it. A schema validator helps you catch mistakes early.

JavaScript validation with Ajv

Python validation with jsonschema

Post-processing tips:

  • Normalize dates to YYYY-MM-DD.
  • Round currency to two decimals; verify totals equal sum(line_items).
  • Use regexes to cross-check fields like invoice numbers or ABNs.

Handling long or messy documents

  • Chunking: Split large docs by sections (headings, pages). Extract per chunk, then reconcile.
  • Priority zones: Use heuristics (e.g., sections near “Invoice”, “Total”) to bias extraction.
  • Vision/OCR: If you have images or scanned PDFs, run OCR first, or use a multimodal model that can read images and text.
  • Conflict resolution: If chunks disagree, prefer the most recent date or the chunk with higher confidence (e.g., presence of currency and totals together).

Reliability patterns

  • Few-shot grounding: Include a minimal example of input → output to reduce ambiguity.
  • Null over guess: Encourage nulls when uncertain; better for data quality.
  • Retries with variation: On validation failure, retry with a nudge (e.g., “Total must equal sum of items”).
  • Human-in-the-loop: Route edge cases to review; log diffs between model and human corrections to improve prompts.

Cost and model choices

  • Start small: For extraction, lighter models like gpt-4o-mini are often sufficient and cost-effective.
  • Upgrade when needed: If you see frequent nulls or errors on complex docs, try a stronger model (e.g., gpt-4o).
  • Batching and streaming: Process documents in parallel within rate limits; use backoff with jitter.

Security and governance

  • Redact PII you don’t need before sending to the model.
  • Log only what’s required; mask secrets in observability tools.
  • If data residency matters, consider a regional deployment option that meets your compliance needs (for Australian workloads, ensure your provider supports AU regions).

Putting it all together

  1. Define a tight JSON schema.
  2. Write a clear, firm prompt (nulls over guesses, JSON only).
  3. Use tool calling to enforce structure.
  4. Validate outputs; add post-processing rules.
  5. Handle long docs with chunking and reconciliation.
  6. Monitor quality, add retries, and include human review for edge cases.
  7. Optimize cost and ensure security/compliance.

Conclusion

Structured extraction doesn’t have to be fragile. With a focused schema, crisp instructions, and OpenAI’s tool calling, you can turn unstructured text into reliable JSON and wire it into your operational systems. Start small, validate everything, and iterate toward the accuracy your business needs.


Discover more from CPI Consulting

Subscribe to get the latest posts sent to your email.