Ghostboard pixel

DCOX, PDFs Were Not Built for AI. This New Open Standard Wants to Change That

The spec looks to simplify how AI systems read and process documents under a vendor-neutral umbrella.
Warp Terminal

The LF AI & Data Foundation has announced the formation of the DocLang Specification Working Group, kicking off a collaborative effort to build an open, AI-native document format standard.

The working group operates under the Joint Development Foundation's vendor-neutral governance model, ensuring that no single company controls the roadmap.

The founding members are IBM, NVIDIA, Red Hat, ABBYY, and HumanSignal. Though, the spec documentation also credits Forgis as a founding member, but the announcement didn't mention them.

By the way, DocLang is not the only thing in play here. Combining its open document format specification with Docling, IBM's open source document processing toolkit also under LF AI & Data, the initiative is looking to build a more complete open source document AI stack under one roof.

Together, the two cover the full pipeline from document ingestion and parsing through standardized representation and downstream consumption by language models and agentic AI systems.

As for the specification itself, it is already at v0.6, is available under the Apache 2.0 License, and covers document structure and semantics, geometric layout, pagination, and complex components like tables, charts, formulas, and code blocks.

There's also native support for audio, image, and video content, and governance metadata like privacy flags and model training constraints are embedded directly in the document rather than stored in a separate file.

Who is it for?

The primary target is enterprises running generative AI and agentic workflows on large document sets. Formats like PDF, DOCX, and JPEG were designed for human consumption, not machine interpretation.

When such files are fed into AI pipelines, their reading order gets mangled, tables flatten into plain text, and figures disappear entirely. The result is a scenario where the document quality becomes the bottleneck, not the model itself.

DocLang is meant to fix that by giving pipelines a single, unambiguous representation where the same document always produces the same output regardless of which tool processed it.

It is also relevant to anyone building with LLMs and vision-language models on real-world content. Docling and ABBYY FineReader Engine already support DocLang output natively, so existing pipelines can adopt the standard without overhauling their tooling.

You can go through the specification for DocLang on GitHub.


Suggested Read 📖: Open Standards for What AI Actually Costs

Linux Foundation Wants Open Standards for What AI is Actually Costing You
The Tokenomics Foundation will work on vendor-neutral benchmarks for token spend, with backing from major players.
About the author
Sourav Rudra

Sourav Rudra

A nerd with a passion for open source software, custom PC builds, motorsports, and exploring the endless possibilities of this world.

Become a Better Linux User

With the FOSS Weekly Newsletter, you learn useful Linux tips, discover applications, explore new distros and stay updated with the latest from Linux world

itsfoss happy penguin

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to It's FOSS.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.