Read and Extract Metadata from Documents

Compart | July 20, 2020

What Does Traditional Document Processing Look Like Nowadays?

Typically, a composition platform first creates a template for the communication. Once the template is formed, it can be populated with data to create variable content. Sometimes this data is readily available, other times it needs to be converted from another application or mapped from various sources. Sometimes the conversion can become quite complicated through scanning then analysis/text recognition via optical character recognition (OCR), "de-formatting" the document, and finally extracting the raw data. Often times, this cumbersome approach to analog conversations results in content consisting of “pixel clouds”. While this process is clumsy, it also results in the loss of semantic structural data that could be valuable for future use.

Without Metadata, Structural Information is Lost

Often times, the problem with traditional document processing is that the approach is oriented to letter-sized page formats, which is fine for a print, fax or archive file, but not for mobile devices or the Web. It could be improved if the raw data was only transfered. In other words, document creation and delivery must occur outside of the specific application. Ergo the page size and output channel are not selected in the application, but much later than is generally practiced today.

Is PDF Delivery Still Up to Date?

Of course the adoption of the now ubiquitous electronic PDF is an important step to shortening the cycle described above. But it is just a beginning. After all, what good is a PDF document if it has no metadata for multi-channel-capable processing? Technologies like XMP were, in fact, developed specifically for storing metadata in an electronic document for automatic read-out on the recipient side and transfer into the given application (ERP, CRM, etc.).

Automated Document Preparation in the Digital Inbox

Read more: Central Conversion Service and Digital Inbox - DocBridge® Conversion Hub

This certainly advances automation in document processing, but it is by no means the end of the road. For one thing, PDF is also page size based, which means tedious “de-formatting" for delivery on mobile end devices. The gain is marginal, considering that processes like de-formatting and decomposition are complex and usually require expensive tools.

Summary

Reading time: 5 min

Paper versus XML
Recipients determine the communication channel
The Role of a Central Data Hub

Read and extract metadata from documents

Data Hub

So what does document processing look like in the future? Without a doubt, the most elegant method is to create an interface for the pure data, independent of page format, layout and channel. That is really the only way to efficiently prepare documents of all types and formats for digital and physical communication routes. For companies, this means separating document creation from delivery and setting up a central document and output management instance. This hub uses defined rules and criteria from the different departments (e.g., sales, marketing, service) to determine the data, layout, format and output channel, always tuned, of course, to the recipient.

Centralization not only benefits the processor, who is free to concentrate on his or her core business. It also provides a reliable overview of which documents left the company in a given time period. Other criteria can also be monitored, of course, an advantage not to be underestimated: many firms lack an accurate picture of just how much is printed, faxed, and sent electronically. What document management lacks is the 360-degree view.

Recipient and Operation Determine the Channel

Strictly speaking, multi-channel communication means breaking away from a specific page format so that every document can be output on any channel without expensive workarounds such as de-formatting.

What's important in today's Customer Communication?

White Paper: Omnichannel Communication, Automation, Multilingualism, Accessibility, Cloud Computing and more.

Because today customers do communicate with companies via a number of channels. Mr. X, for example, still wants his insurance policy in hard copy, but would prefer his monthly debit notification as an e-mail attachment, or better yet, sent directly to his smartphone. In other words, a delivery medium is chosen for each and every business process. But that is possible only through central processing where all document-related communication pathways converge, particularly if adding new channels is straightforward.

In this context, HTML5 has certainly paved the way toward modern document processing. The text-based markup language is already setting the tone with mobile platforms such as the iPhone, iPad and Android devices. And it’s no wonder: HTML5 content can be easily processed for any electronic output channel, be it a smartphone or a Web site. And if print is your preference, it's still an option. Conversion to PDF files is also possible.

HTML5 is currently the most intelligent format for the creation and display of documents, regardless of size or output channel. It allows dynamic, size-dependent display, e.g., from letter-sized to smartphone, conversion from any layout to text-oriented formats, extraction of individual data (including retrieval of invoice items) and building tables of contents and index lists.

What Is Data, What Is a Document?

The fact is in these multi-channel times, "painting" a letter-sized page using page composition tools is the wrong approach, because the target layout can be anything from 2 to 24 inches. Instead, companies need to invest in document logistics capable of taking data from a given application and preparing it specific to the recipient and output channel.

Output Management System

Solution: A central hub for all customer communication

What is needed is information technology that maps the entire document management cycle in a central system, and specifically for all applications that generate documents. Clearly defined rules for corporate design, output formats, and handling of metadata are stored based on business logic. This makes the question "what is data and what is a document?" even more important. The boundary is not always clear, but one thing is certain. The further downstream in the document logistics process the output channel is chosen and the more strictly the business process remains separate from document creation, the more flexible the company is.

Create and Read Metadata
in Digitization and Processing Operations

XMP - Master of All Data

The digitization of document processing is evolving rapidly. Companies are sending, receiving and processing more and more invoices, damage reports, contracts, customer confirmations and other correspondence electronically. As a result, input and output management are beginning to converge. Automating as many processes as possible is especially important in this endeavor. Metadata, or the information about the document (type, creation date, sender, reference to other processes), drives those processes.

Metadata is hardly new, having become a fairly familiar element in the everyday work-life. In a typical input management scenario, documents are scanned, converted to text via optical character recognition (OCR), and then usually stored in an archive or data management system (DMS). At the same time the metadata must be identified via barcodes, rules, and practical methods, then stored correctly assigning and categorizinge the documents to ensure their retrievability at any time. A lot has to be checked manually afterwards, which is naturally quite tedious.

Paper vs. XML

Today there are two basic variants of document exchange: traditional paper or strictly electronic in an XML or EDIFACT format, which is not printable per se. Current developments such as ZUGFeRD combine the two extremes in a single format, which ultimately is best. But if individual PDF files are generated as virtual by-products from the traditional print data stream and sent to the recipient, a gap emerges. In spite of electronic transfer, there is usually too little metadata to automatically process the document. So how do we further automate non-standardized document exchange?

XMP: Bridge Between Physical And Electronic Exchange

This is where XMP enters the picture (see dossier). An XMP packet is an XML file that defines the guidelines for embedding the metadata not only in PDF documents -- surely the most frequent use case -- but also in PostScript, JPG, PNG, TIFF, HTML and AFP files. XMP packets have a major advantage. They contain a unique marker and are, where possible, always saved in plain text so that even an application that fails to understand the specific data format can still extract the XMP metadata. As for the use of plain text, caution should prevail where confidential information is concerned.

XMP defines a set of core properties that can be used universally (title, creator, topic, date, unique identifier, language, description). XMP falls back on existing methods and standards to describe metadata (ontologies) such as Dublin Core, IPTC, and Exif (see dossier). XMP also allows the definition of individual attributes such as customer and policy number, document validity, invoice due date, and name/version of a document template.

XMP Awareness Is Still Lacking

Currently XMP is used primarily in PDF/A. The ISO standard itself makes recommendations for defining and saving metadata. It mandates the XMP packet for PDF/A documents, for example, and recommends using a unique identifier. It also recommends carrying document origin information through the entire process, especially when conversions are done. In PDF/A files, all the individual properties must be described via an embedded schema. That could be a reason why XMP and the extensive use of metadata are still being neglected. Yet this step is far less complicated than it appears.

The topic is hardly new. After all, it's not as if we only just started thinking about how to automate the steps of document processing. What is new is the buzz being generated by the progressive conflation of input and output processes due to electronic exchange. Output management used to focus on producing and sending one's own documents efficiently and reliably. Now we're forced to consider the output side and what data will make processing the document easier for the recipient. One important thing to remember is that every instance of media discontinuity results in a loss of data that has to be tediously restored downstream.

Data Quality Is Critical

Using the right data saves time and cost -- if not primarily for output management but certainly for archiving and, of course, the recipient. Taking the long view, accurate metadata does raise general awareness, ultimately benefitting one's own input management. The fact is that a minimum set of meaningful information can considerably simplify electronic processing. Against this backdrop, XMP is certainly an important step on the path to full automation on both the input and output sides. Of course, metadata can still be stored in documents without XMP, but its scope and processing quality are limited. In any case, there is currently no good alternative for PDF documents.

Background Knowledge

Metadata

Metadata or metainformation is structured data that contains information about characteristics of other data.

The data described by metadata are often larger data collections such as documents, books, databases or files. Thus, information of properties of a single object (for example, "person's name") is also referred to as its metadata.

Source: Wikipedia

XMP

XMP (Extensible Metadata Platform)

Standard for embedding metadata in digital files
Published by Adobe in 2001 and first integrated in Acrobat Reader 5
February 2012: Publication of core XMP specification as ISO standard 16684-1

Among other things, XMP defines:

the language of the document (one of the most important properties; especially important for the sight-impaired/reading aloud of the document via a screen reader in the correct language)
the creation date
author/company name (origin of the document)
Keywords

RDF

RDF (Resource Description Framework)

RDF is a technical approach used to describe Web resources (object, position, person) and their relationships to one another. RDF was originally conceived by the World Wide Web Consortium (W3C) as the standard for defining metadata In the meantime, RDF has become the fundamental component of the “Semantic Web.” RDF resembles classic conceptual modeling approaches such as UML class diagrams and the entity-relationship model.

Standardization was aimed at summarizing frequently used statements, via an object, into so-called ontologies that are identified by a namespace URI (Universal Resource Identifier). This allows programs to display data logically to viewers.

Ontology

is a group of concepts used to define metadata such as title, author, topic, description, date, identifier, language, camera type (for photos/picture) and where taken
conventional ontologies are Dublin Core, IPTC, Exif

Read and Extract Metadata from Documents

What Does Traditional Document Processing Look Like Nowadays?

Without Metadata, Structural Information is Lost