XMP - Master of All Data
The digitization of document processing is proceeding apace. Companies are sending, receiving and processing more and more invoices, damage reports, contracts, customer confirmations and other correspondence electronically. As a result, input and output management are beginning to converge. Automating as many processes as possible is especially important in this endeavor. Metadata, or the information about the document (type, creation date, sender, reference to other processes), drives those processes.
Metadata is hardly new, having become a fairly familiar element in the workaday world. In a typical input management scenario, documents are scanned, converted to text via optical character recognition (OCR), and then usually stored in an archive or data management system (DMS). At the same time the metadata must be identified via barcodes, rules, and heuristics and then stored to correctly assign and categorize the documents and ensure their retrievability at any time. A lot has to be checked manually afterwards, which is naturally quite costly.
Paper versus XML
Today there are two basic variants of document exchange: traditional paper or strictly electronic in an XML or EDIFACT format, which is not printable per se. Current developments such as ZUGFeRD combine the two extremes in a single format, which ultimately is best. But if individual PDF files are generated as virtual by-products from the traditional print data stream and sent to the recipient, a gap emerges. In spite of electronic transfer, there is usually too little metadata to automatically process the document. So how do we further automate non-standardized document exchange?
XMP: Bridge between physical and electronic exchange
This is where XMP enters the picture (see dossier). An XMP packet is an XML file that defines the guidelines for embedding the metadata not only in PDF documents -- surely the most frequent use case -- but also in PostScript, JPG, PNG, TIFF, HTML and AFP files. XMP packets have a major advantage. They contain a unique marker and are, where possible, always saved in plain text so that even an application that fails to understand the specific data format can still extract the XMP metadata. As for the use of plain text, caution should prevail where confidential information is concerned.
Example of an XMP packet:
<?xpacket begin="ï»¿" id="W5M0MpCehiHzreSzNTczkc9d"?>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="">
<rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="">
XMP defines a set of core properties that can be used universally (title, creator, topic, date, unique identifier, language, description). XMP falls back on existing methods and standards to describe metadata (ontologies) such as Dublin Core, IPTC, and Exif (see dossier). XMP also allows the definition of individual attributes such as customer and policy number, document validity, invoice due date, and name/version of a document template.
XMP awareness is still lacking
Currently XMP is used primarily in PDF/A. The ISO standard itself makes recommendations for defining and saving metadata. It mandates the XMP packet for PDF/A documents, for example, and recommends using a unique identifier. It also recommends carrying document origin information through the entire process, especially when conversions are done. In PDF/A files, all the individual properties must be described via an embedded schema. That could be a reason why XMP and the extensive use of metadata are still being neglected. Yet this step is far less complicated than it appears.
The topic is hardly new. After all, it's not as if we only just started thinking about how to automate the steps of document processing. What is new is the buzz being generated by the progressive conflation of input and output processes due to electronic exchange. Output management used to focus on producing and sending one's own documents efficiently and reliably. Now we're forced to consider the output side and what data will make processing the document easier for the recipient. One important thing to remember is that every instance of media discontinuity results in a loss of data that has to be tediously restored downstream.
Data quality is critical
Using the right data saves time and cost -- if not primarily for output management but certainly for archiving and, of course, the recipient. Taking the long view, accurate metadata does raise general awareness, ultimately benefitting one's own input management. The fact is that a minimum set of meaningful information can considerably simplify electronic processing. Against this backdrop, XMP is certainly an important step on the path to full automation on both the input and output sides. Of course, metadata can still be stored in documents without XMP, but its scope and processing quality are limited. In any case, there is currently no good alternative for PDF documents.
XMP (Extensible Metadata Platform):
- Standard for embedding metadata in digital files
- Published by Adobe in 2001 and first integrated in Acrobat Reader 5
- February 2012: Publication of core XMP specification as ISO standard 16684-1
XMP is based on open standards and embeds the formal RDF (Resource Description Framework) published by the World Wide Web Consortium in binary data. Metadata is integrated in different applications according to a uniform schema, thus allowing other programs to read the files. The format is supported by all Adobe products, software from other manufacturers, and suppliers of editing systems.
Among other things, XMP defines:
- the language of the document (one of the most important properties; especially important for the sight-impaired/reading aloud of the document via a screen reader in the correct language)
- the creation date
- author/company name (origin of the document)
RDF (Resource Description Framework)
RDF is a technical approach used to describe Web resources (object, position, person) and their relationships to one another. RDF was originally conceived by the World Wide Web Consortium (W3C) as the standard for defining metadata In the meantime, RDF has become the fundamental component of the “Semantic Web.” RDF resembles classic conceptual modeling approaches such as UML class diagrams and the entity-relationship model.
Standardization was aimed at summarizing frequently used statements, via an object, into so-called ontologies that are identified by a namespace URI (Universal Resource Identifier). This allows programs to display data logically to viewers.
- is a group of concepts used to define metadata such as title, author, topic, description, date, identifier, language, camera type (for photos/picture) and where taken
- conventional ontologies are Dublin Core, IPTC, Exif
- Uniform format for electronic invoices developed by the German Forum for Electronic Invoicing (FeRD)
- Combination of the visual representation of a document and its raw data in a single PDF/A-3 file to avoid manual interventions in the automatic processing chain