Data: Eyes and Ears of the AI
The next stage of digitization in customer communication is in full swing - and means the extensive automation of all document-relevant processes. The basis for this is structured, consistent and centrally available data.
The following article highlights the different facets of a professional handling of information in document processing.
The next stage of digitization in customer communication is in full swing - and means the extensive automation of all document-relevant processes. The basis for this is structured, consistent and centrally available data. The following article highlights the different facets of a professional handling of information in document processing.
The global volume of data continues to grow strongly. Above all, unstructured data in the form of photos, audio files and videos as well as presentations and text documents will grow disproportionately - according to the market research institute IDC by an average of 62 percent annually. By 2022, this data type is expected to account for around 93 percent of the total volume.¹
According to a Gartner definition, unstructured data includes "all content that does not correspond to a specific, predefined data model. It's usually human-generated and person-related content that doesn't fit well into databases." But they often contain valuable customer and behavioural information, the evaluation of which can be important for well-founded decisions.
In addition, in-depth analysis of unstructured data forms the basis for better and expanded services, which can even lead to completely new business models. IDC expects companies that analyze all relevant data by 2020 to achieve a productivity gain of $430 billion over less analytically oriented competitors.²
Currently, companies are still looking for truly efficient solutions to convert unstructured data into structured data. They face a number of challenges, ranging from the question of geographic location, the type of data storage and governance, to securing and analyzing this information in local and cloud environments. So it is hardly surprising that the MIT Sloan Group classifies 80 percent of all data as untrustworthy, inaccessible or not analyzable. IDC estimates that by 2020 the "digital universe" will contain up to 37 percent of information that could be valuable if analyzed.³
Digitization Means Automation
One thing is certain: Structured and analyzable data are the basic prerequisite for the next stage of digitization in customer communication. This refers to the extensive automation and standardization of processes, so that "human intervention" is less and less necessary ("dark processing"). Routine tasks such as service invoicing, confirmation of address and tariff changes or appointment agreements are already taken over by software solutions, language assistants and chatbots based on AI algorithms (self-learning systems).
What's more, even content with a high creative share, such as technical essays and the like, will sooner or later be generated by AI systems. Already today there are programs that can produce simple Wikipedia articles with simple syntax and grammar. You define certain reference points (structure, keywords) (for a text about a city, for example, the number of inhabitants, year of foundation, town twinning, geographical data) and the system retrieves the necessary data from Wikidata, supplements the corresponding stored text modules, which follow a simple grammar (subject - predicate - object) and merges everything into a finished text.
Many still remember the appearance of Google CEO Sundar Pichai at the IO developer conference in May last year, when he introduced the language assistant "Duplex": The chatbot is able to telephone independently without the called person noticing that he is dealing with an "artificial intelligence".⁴
With other processes, on the other hand, such as the cancellation of an insurance policy or the release of an invoice for more than 50,000 euros, for example, it is certain - partly due to regulatory requirements - that a clerk will continue to look into it in the future. But it is only a matter of time before such sensitive areas are also automated. The more reliable the systems become, the higher the threshold for automated processing can ultimately be set. However, this requires correct handling of the data.
Harald Grumser, founder and CEO of Compart AG, puts it in a nutshell: "Digital processes need access to the content of documents, and artificial intelligence also needs eyes and ears. It is therefore becoming increasingly important to obtain the data required for automation right from the start, to provide it with a structure and to store it correctly."
Documents Are the Human-readable Representation of Data
That concerns also and exactly the document and output management as interface between classical (paper-bound) and electronic communication. Typically, digital data is converted into analog data on the output side (e.g. when printing, but also when transforming text content into audio files ("text-to-speech")). On the other hand, there is the situation in the inbox (input management), where exactly the opposite happens: Analog data is converted into electronic documents (e.g. when scanning, but also when converting audio/video files into readable content) - albeit not necessarily in a very high-quality form.
The challenge now is to transform the information and data generated in all areas of inbound and outbound communication into a structured form and store it in the right "data pots" so that it is available for all processes of document and output management - from the capture of incoming messages (input management) to the creation and processing of documents and their output. It is irrelevant on which digital or analog medium a document is sent or displayed: It is always about the data, because a document is ultimately only its respective representation in a form readable by humans - whereby a distinction must be made here between non-coded and coded documents
In this context, two major trends should be mentioned, which are becoming more and more important and have almost replaced other developments:
- XML (Extensible Markup Language) as a markup language for complex, hierarchical data, and
Both technologies have proven themselves for the description and definition of structured data and will certainly play an even greater role.
Data Must Be Checked, Transferred and Stored Correctly
To ensure that the structured data is actually available for automated processing, it is important that it is stored correctly. Here, non-relational databases such as NoSQL (including the subcategories Graph Database and RDF) now offer new possibilities. Their great advantage over relational databases is that they can manage data even in very complex contexts and thus enable very specific queries (see also the "Glossary"). One of the best known applications for this is Wikidata, the knowledge database of the online encyclopedia Wikipedia, in which tens of millions of facts are now stored. If, for example, you want to know how many Bundesliga players who were born in Berlin are married to Egyptian women, you will certainly find what you are looking for here. Certainly - a very unusual example, but one that makes the significance of the subject clear. The aim is to gain new connections/knowledge from structured data about algorithms (ontologies). This is where artificial intelligence (AI) comes into play, which can then be used to formulate complex queries (see the "Glossary").
A further important topic in this context is that the stored data with a structure must be checked - something that is often not done today. The XML schema, for example, is a proven method for guaranteeing the correctness and completeness of an XML file. Errors caused by unchecked data can be very serious.
Consistent data verification is therefore essential. Last but not least, the data must also be converted into each other using rules. There are also many possibilities for this today, one of the best known is certainly the programming language XSLT (see also the "Glossary"). But there are also other sets of rules.
Instead of Destroying Content....
Anyone who wants to further increase the degree of automation of processes in customer communication in the sense of the next stage of digitization must ensure structured, consistent and centrally available data. For document and output management, this means preserving the content of documents as completely as possible right from the start instead of destroying it - as is often observed in the electronic inbox of companies, for example.
The problem here: In many companies, incoming e-mails are still "typed", i.e. converted into an image format, in order to subsequently make parts of the document content interpretable again by means of OCR technology. It's "Deepest Document Middle Ages." It wastes resources unnecessarily, especially when you consider that email attachments today can be quite complex documents with tens of pages.
Above all, however, this media discontinuity is tantamount to a "data gau": electronic documents (e-mails), which in themselves could be read and processed by IT systems, are first converted into TIFF, PNG or JPG files. So "pixel clouds" arise from content. In other words, the actual content is first encoded (raster images) and then made "readable" again with difficulty using Optical Character Recognition (OCR). This is accompanied by the loss of semantic structural information, which is necessary for later reuse.
How nice would it be, for example, if you could convert e-mail attachments of any type into structured PDF files immediately after receipt? This would lay the foundation for long-term, revision-proof archiving; after all, the conversion from PDF to PDF/A is only a small step.
...Rather Preserved Than the Basis for Further Automation
The following example: A leading German insurance group receives tens of thousands of e-mails daily via a central electronic mailbox, both from end customers and from external and internal sales partners. Immediately after receipt, the system automatically "triggers" the following processes:
- Conversion of the actual e-mail ("body") to PDF/A
- Individual conversion of the e-mail attachment (e.g. various Office formats, image files such as TIFF, JPG, etc.) to PDF/A
- Merging of the e-mail body with the corresponding attachments and generation of a single PDF/A file per business transaction
- At the same time, all important information is read from the file (extracted) and stored centrally for downstream processes (e.g. generation of reply letters on an AI basis, case-closing processing, archiving).
Everything runs automatically and without media discontinuity. The clerk receives the document in a standardized format, without having to worry about preparation (classification, making legible).
The insurer could still "split" the workflow into dark and interactive processing. During dark processing, every incoming e-mail plus attachment is automatically converted into a PDF/A file, transferred to the clerk and finally archived.
Interactive processing, on the other hand, involves the "intelligent" compilation of e-mail documents of different file formats into an electronic dossier (customer file/process). The clerk first opens the e-mail and the attachment on his mail client (Outlook, Lotus Notes, etc.) or his special clerking program and decides what needs to be edited. The normal workflow then applies as with dark processing: conversion - forwarding - processing - archiving.
The interactive variant is particularly useful if not all documents have to be archived. Modern input management systems are now capable of automatically recognizing all common formats of e-mail attachments and converting them into a predefined standard format (e.g. PDF/A or PDF/UA). And: You extract all necessary data from the documents at the same time and store them centrally.
Such scenarios can be implemented, for example, with systems such as DocBridge® Conversion Hub, whose linchpin is a central conversion instance. Its core is a kind of "dispatcher", which analyses every incoming message (e-mail, fax, SMS, messenger service, letter/paper) and automatically converts it into the optimal format for the document in question. How is the further processing to take place?) decides. DocBridge® Conversion Hub also includes an OCR function for extracting content and metadata (Optical Character Recognition).
¹ CIO online, 09/23/2019 („KI ebnet den Weg zu unstrukturierten Informationen“)
⁴ The example of an agreement for a hairdresser's appointment showed the new dimension of intelligent speech systems such as "Duplex": Previous systems can usually be recognized as "robots" within a few words (unnatural sounding voice, wrong emphasis, choppy sentences, wrong or no response to requests). Not so AI tools of the new generation: They are quite able to capture content with complex syntax and "talk" so skilfully with people that they do not notice who or what their counterpart is.
Wikidata, NoSQL, XML and more - a small overview of important terms related to structured data and data storage
Wikidata is a freely accessible and jointly maintained knowledge database which, among other things, aims to support the online encyclopedia Wikipedia. The project was launched in 2012 by Wikimedia Deutschland e.V., a non-profit organisation for the dissemination of free knowledge, and provides a common source of certain types of data for Wikimedia projects (e.g. birth dates, universal data) that can be used in all Wikimedia articles.
Wikidata structures the knowledge of the world in language-independent data objects that can be enriched with various information. People as well as machines and IT systems can access this treasure trove of data and generate new knowledge. Wikibase, the software behind Wikidata, is also available as free and open software for all people.
One of the many examples of an open data project created with Wikibase is Lingua Libre. The directory of free audio voice recordings aims to preserve the sound of the world's languages and the pronunciation of their words in the form of structured data and make it available to all people. The project originated in France, where the initiators were keen to promote endangered regional languages. One advantage of Lingua Libre is that interested users can complete the records - with just a few words, proverbs or entire sentences. So even people who are not familiar with phonetic transcription can hear how individual words are pronounced at the click of a mouse. With the launch of the Wikibase installation Lingua Libre 2018, around 100,000 audio files in 46 languages were added to the directory.
Meanwhile up to 1200 recordings per hour can be recorded via the online application and uploaded directly into the free media archive Wikimedia Commons. Via the connection to Wikidata, the recorded sounds enrich Wikimedia projects such as Wikipedia or the free dictionary Wiktionary in particular - but they also support linguistics specialists in their research.
Since its launch, Wikidata has recorded a comparatively strong growth in content pages, with over 60 million data objects now available (as of September 2019).
Relational databases are used for electronic data management in computer systems and are based on a table-based relational database model. The basis of their concept is the relation. It represents a mathematical description of a table and is a well-defined term in the mathematical sense. Operations on these relations are determined by relational algebra.
The associated database management system is called RDBMS (Relational Database Management System). The SQL (Structured Query Language) language, whose theoretical basis is relational algebra, is predominantly used for querying and manipulating the data. The relational database model was first proposed in 1970 by Edgar F. Codd and is still an established standard for databases despite some criticism.
Ontologies in computer science are mostly linguistic and formally ordered representations of a set of concepts and the relations existing between them in a certain subject area. They are used to exchange "knowledge" in digital and formal form between application programs and services. Knowledge includes both general knowledge and knowledge about very specific topics and processes.
Ontologies serve as a means of structuring and exchanging data to
- to merge already existing knowledge
- to search and edit existing knowledge
- generate new instances from types of knowledge
Ontologies contain inference and integrity rules, i.e. rules on conclusions and on ensuring their validity. They have experienced an upswing with the idea of the semantic web in recent years and are thus part of the representation of knowledge in the field of artificial intelligence. In contrast to a taxonomy, which forms only a hierarchical subdivision, an ontology represents a network of information with logical relations.
NoSQL ("Not only SQL") refers to databases that follow a non-relational approach and thus break with the long history of relational databases. These data stores do not require fixed table schemata and try to avoid joins (result tables). They scale horizontally. In the academic environment they are often referred to as "structured storage".
Relational databases typically suffer from performance problems with data-intensive applications such as indexing large volumes of documents, high-load websites, and streaming media applications. Relational databases are only efficient if they are optimized for frequent but small transactions or for large batch transactions with infrequent write access. However, they cannot cope well with high data requirements and frequent data changes at the same time.
NoSQL, on the other hand, handles many simultaneous read/write requests quite well. NoSQL implementations usually support distributed databases with redundant data storage on numerous servers, for example using a distributed hash table. This allows the systems to be easily expanded and to withstand server failures.
RDF (Resource Description Framework) describes a technical approach on the Internet to formulate logical statements about arbitrary things (resources). Originally, RDF was designed by the World Wide Web Consortium (W3C) as a standard for describing metadata.
Meanwhile RDF is regarded as a fundamental building block of the "semantic web". RDF is similar to the classical methods for modeling concepts (UML class diagrams, entity relationship model). In the RDF model, each statement consists of the three units subject, predicate, and object, whereby a resource is described in more detail as a subject with another resource or a value (literal) as an object.
With another resource as a predicate, these three units form a triple. In order to have globally unique identifiers for resources, these are formed according to convention analogous to the URL. URLs for commonly used descriptions (e.g. for metadata) are known to RDF developers and can therefore be used worldwide for the same purpose, which among other things enables programs to display the data meaningfully for humans.
The Extensible Markup Language (XML) is a markup language used to represent hierarchically structured data in the format of a text file that can be read by both humans and machines.
XML is also used for the platform- and implementation-independent exchange of data between computer systems, especially via the Internet, and was published by the World Wide Web Consortium (W3C) on February 10, 1998. The current version is the fifth edition dated November 26, 2008. XML is a meta language on the basis of which application-specific languages are defined by structural and content restrictions. These restrictions are expressed either by a Document Type Description (DTD) or by an XML Schema. Examples of XML languages are: RSS, MathML, GraphML, XHTML, XAML, Scalable Vector Graphics (SVG), GPX, but also the XML Schema itself.
XSL Transformation, XSLT for short, is a programming language for transforming XML documents. It is part of the Extensible Stylesheet Language (XSL) and represents a Turing complete language.
XSLT was published as a recommendation by the World Wide Web Consortium (W3C) on October 8, 1999. XSLT is based on the logical tree structure of an XML document and is used to define conversion rules. XSLT programs, so-called XSLT stylesheets, are themselves structured according to the rules of the XML standard.
The stylesheets are read by special software, the XSLT processors, which use these instructions to convert one or more XML documents into the desired output format. XSLT processors are also integrated in many modern web browsers, such as Opera (version 9 or higher), Firefox and Internet Explorer version 5 (version 6 or higher with full XSLT 1.0 support). XSLT is a subset of XSL, along with XSL-FO and XPath.
The "Semantic Web" extends the Internet in such a way as to make data more exchangeable between computers and easier for them to use; for example, the term "Bremen" can be supplemented in a web document with information as to whether a ship, family or city name is meant here. This additional information explicates the otherwise unstructured data. Standards for the publication and use of machine-readable data (especially RDF) are used for implementation.
While people can infer such information from the given context (from the whole text, about the kind of publication or the category in it, pictures etc.) and unconsciously build up such links, machines must first teach this context; for this purpose the contents are linked with further information.
The "Semantic Web" conceptually describes a "Giant Global Graph". All things of interest are identified and, provided with a unique address, created as "nodes", which in turn are connected to each other by "edges" (also uniquely named). Individual documents on the Web then describe a series of edges, and the totality of all these edges corresponds to the global graph.
JSON was originally specified by Douglas Crockford. Currently it is specified by two competing standards, RFC 8259 from Douglas Crockford, and ECMA-404.
Source: Wikipedia; ITWissen.info