
“Enterprises today work across a fragmented landscape of document formats, including PDFs, JPEGs, and other file types built primarily for human consumption rather than AI interpretation,” the group said in its launch announcement.
“This disconnect can introduce complexity, raise costs, and reduce reliability when extracting meaning from business documents,” as organizations increasingly rely on generative AI and agentic systems, it said.
Mark Collier, executive director of LF AI & Data, said the goal of the DocLang Specification Working Group is to “develop a vendor-neutral, interoperable standard that helps organizations prepare document data for AI more reliably, transparently, and at scale.”
DocLang defines a structured, machine-readable format for documents of any type, like JSON for data, that any tool can implement and any pipeline can consume. It builds on DocLing, a document processing toolkit hosted by LF AI & Data that can transform human-readable PDFs, word processor documents or spreadsheets into structured data.

