Compart - Document- and Output-Management

Development and Technology

Convert High-volume Document Batches and AFP Data Streams to PDF/UA

Compart |

Learn how it works automatically

Anyone who wants to make archive documents available without barriers faces the challenge of converting millions of documents and data streams to PDF/UA in batches. Compart offers a solution that automates this process - regardless of how old and in what format the documents are in the archives.

Below we explain how this process works.

1. Classify documents

Companies with a high volume of customer communication, such as insurance companies, banks or utility companies, have one thing in common. Many of their existing documents have a static layout. An invoice, an account statement or an insurance policy usually follow a fixed content and graphic structure. Based on this principle, the documents can be classified and a separate set of rules can be developed for each document class.

2. Create a set of rules and save them in templates

With the help of a graphical user interface, a template designer uses a sample document to open up the content, assigns semantic tags to it and combines all elements into a coherent document structure tree with a logical reading order.

2.1 Indexing the content

  • Decorative and unimportant information on the document are marked as artifacts. Examples of typical artifacts are:
  • Purely decorative elements such as background graphics or decorative lines
  • Print marks
  • Page numbers
  • Telephone numbers, e-mail addresses and links are identified
  • Images are recognized
  • Tables with header and line content are indexed
  • Logos, images or other graphic elements are marked

Screenshot shows: Classification of a table as a content element.
 

2.2 Tagging the content

  • The indexed content elements are provided with semantically suitable tags (standard PDF 1.7 tags)
  • Fixed alternative texts are stored for images or links
  • For footers and headers you can create rules describing the content, so that it is tagged accordingly for every page
     

2.3 Create the reading order

  • All elements are automatically added to a document structure tree that specifies the reading order
  • The reading order can be influenced and optimized using "drag and drop"
  • The reading order does not have to correspond to the semantics. It is ignored by PDF/UA validators, but is important in order to create a good quality PDF/UA

Screenshot: Classification of a table as a content elementScreenshot shows: "Tagged" content elements in the document structure tree. Optimization of the reading order via drag-and-drop.
 

3. Integrate PDF/UA into the document processes

All these rules are saved for each document class and exported as a template. The template can be repeatedly applied to any number of documents of the same class in batch via an automatic workflow. At the end of the creation process, the result is legally compliant PDF/UA documents that can be further processed without further conversion.

Established test procedures

Two test methods have been established to check the conformity of a PDF/UA:

1. PDF Accessibility Checker (PAC) 2024

  • According to the standards of the complete Matterhorn protocol and WCAG 2.1.
  • Check is carried out via a GUI
  • incl. display of the tag tree and a screen reader preview
  • https://pac.pdf-accessibility.org/

2. veraPDF

  • According to the standards of the complete Matterhorn protocol testing can be carried out either via a GUI application or via a command line application
  • Java library for integration into existing applications
  • https://verapdf.org/

PDF standard 1.7

PDF/UA is based on the PDF standard 1.7, which means that if you still have PDF 1.1 or PDF/A1 documents in your archive, for example, which are based on the PDF standard 1.4, the first step is to optimize them to a higher standard such as PDF/A2 or PDF/A3. Compart can also support you here.