DOCX, PDFs Were Not Built for AI. This New Open Standard Wants to Change That

The LF AI & Data Foundation has announced the formation of the DocLang Specification Working Group, kicking off a collaborative effort to build an open, AI-native document format standard.

The working group operates under the Joint Development Foundation's vendor-neutral governance model, ensuring that no single company controls the roadmap.

The founding members are IBM, NVIDIA, Red Hat, ABBYY, and HumanSignal. Though, the spec documentation also credits Forgis as a founding member, but the announcement didn't mention them.

By the way, DocLang is not the only thing in play here. Combining its open document format specification with Docling, IBM's open source document processing toolkit also under LF AI & Data, the initiative is looking to build a more complete open source document AI stack under one roof.

Together, the two cover the full pipeline from document ingestion and parsing through standardized representation and downstream consumption by language models and agentic AI systems.

As for the specification itself, it is already at v0.6, is available under the Apache 2.0 License, and covers document structure and semantics, geometric layout, pagination, and complex components like tables, charts, formulas, and code blocks.

There's also native support for audio, image, and video content, and governance metadata like privacy flags and model training constraints are embedded directly in the document rather than stored in a separate file.

Who is it for?

The primary target is enterprises running generative AI and agentic workflows on large document sets. Formats like PDF, DOCX, and JPEG were designed for human consumption, not machine interpretation.

When such files are fed into AI pipelines, their reading order gets mangled, tables flatten into plain text, and figures disappear entirely. The result is a scenario where the document quality becomes the bottleneck, not the model itself.

DocLang is meant to fix that by giving pipelines a single, unambiguous representation where the same document always produces the same output regardless of which tool processed it.

It is also relevant to anyone building with LLMs and vision-language models on real-world content. Docling and ABBYY FineReader Engine already support DocLang output natively, so existing pipelines can adopt the standard without overhauling their tooling.

You can go through the specification for DocLang on GitHub.

DocLang Specification

Suggested Read 📖: Open Standards for What AI Actually Costs

Enjoyed this update? Support independent Linux news coverage

It's FOSS has been helping people use Linux for the past 14 years. Help us stay independent from big tech. Become a Plus member, enjoy ad-free reading and get 5 eBooks.

Plus yearly

Ad-free, FREE ebooks

Join yearly

Best value

Plus lifetime

Pay once, Enjoy forever

Go lifetime

Buy us a coffee

Any amount, no commitment

Support on Ko-fi

DOCX, PDFs Were Not Built for AI. This New Open Standard Wants to Change That

Who is it for?

Sourav Rudra

Become a Better Linux User

DOCX, PDFs Were Not Built for AI. This New Open Standard Wants to Change That

Who is it for?

Sourav Rudra

Ubuntu 26.10 "Stonking Stingray": All the New Features So Far

Ubuntu 26.04 Stopped Notifying Users About Updates, But That Was Intentional

Codeberg Has Drawn a Hard Line on Use of AI With Community Backing

Collabora Online Turns 10, Refusing to Hand Over Your Data to Someone Else's AI

Plex's Open Source Alternative Jellyfin is Having a Leadership Crisis

Become a Better Linux User