Create and Read Metadata
in Digitization and Processing Operations
XMP - Master of All Data
The digitization of document processing is evolving rapidly. Companies are sending, receiving and processing more and more invoices, damage reports, contracts, customer confirmations and other correspondence electronically. As a result, input and output management are beginning to converge. Automating as many processes as possible is especially important in this endeavor. Metadata, or the information about the document (type, creation date, sender, reference to other processes), drives those processes.
Metadata is hardly new, having become a fairly familiar element in the everyday work-life. In a typical input management scenario, documents are scanned, converted to text via optical character recognition (OCR), and then usually stored in an archive or data management system (DMS). At the same time the metadata must be identified via barcodes, rules, and practical methods, then stored correctly assigning and categorizinge the documents to ensure their retrievability at any time. A lot has to be checked manually afterwards, which is naturally quite tedious.
Paper vs. XML
Today there are two basic variants of document exchange: traditional paper or strictly electronic in an XML or EDIFACT format, which is not printable per se. Current developments such as ZUGFeRD combine the two extremes in a single format, which ultimately is best. But if individual PDF files are generated as virtual by-products from the traditional print data stream and sent to the recipient, a gap emerges. In spite of electronic transfer, there is usually too little metadata to automatically process the document. So how do we further automate non-standardized document exchange?
XMP: Bridge Between Physical And Electronic Exchange
This is where XMP enters the picture (see dossier). An XMP packet is an XML file that defines the guidelines for embedding the metadata not only in PDF documents -- surely the most frequent use case -- but also in PostScript, JPG, PNG, TIFF, HTML and AFP files. XMP packets have a major advantage. They contain a unique marker and are, where possible, always saved in plain text so that even an application that fails to understand the specific data format can still extract the XMP metadata. As for the use of plain text, caution should prevail where confidential information is concerned.
XMP defines a set of core properties that can be used universally (title, creator, topic, date, unique identifier, language, description). XMP falls back on existing methods and standards to describe metadata (ontologies) such as Dublin Core, IPTC, and Exif (see dossier). XMP also allows the definition of individual attributes such as customer and policy number, document validity, invoice due date, and name/version of a document template.
XMP Awareness Is Still Lacking
Currently XMP is used primarily in PDF/A. The ISO standard itself makes recommendations for defining and saving metadata. It mandates the XMP packet for PDF/A documents, for example, and recommends using a unique identifier. It also recommends carrying document origin information through the entire process, especially when conversions are done. In PDF/A files, all the individual properties must be described via an embedded schema. That could be a reason why XMP and the extensive use of metadata are still being neglected. Yet this step is far less complicated than it appears.
The topic is hardly new. After all, it's not as if we only just started thinking about how to automate the steps of document processing. What is new is the buzz being generated by the progressive conflation of input and output processes due to electronic exchange. Output management used to focus on producing and sending one's own documents efficiently and reliably. Now we're forced to consider the output side and what data will make processing the document easier for the recipient. One important thing to remember is that every instance of media discontinuity results in a loss of data that has to be tediously restored downstream.
Data Quality Is Critical
Using the right data saves time and cost -- if not primarily for output management but certainly for archiving and, of course, the recipient. Taking the long view, accurate metadata does raise general awareness, ultimately benefitting one's own input management. The fact is that a minimum set of meaningful information can considerably simplify electronic processing. Against this backdrop, XMP is certainly an important step on the path to full automation on both the input and output sides. Of course, metadata can still be stored in documents without XMP, but its scope and processing quality are limited. In any case, there is currently no good alternative for PDF documents.