One of the best known applications for this is Wikidata, the knowledge database of the online encyclopedia Wikipedia, in which tens of millions of facts are now stored. If, for example, you want to know how many Bundesliga players who were born in Berlin are married to Egyptian women, you will certainly find what you are looking for here. Certainly - a very unusual example, but one that makes the significance of the subject clear. The aim is to gain new connections/knowledge from structured data about algorithms (ontologies). This is where artificial intelligence (AI) comes into play, which can then be used to formulate complex queries (see the "Glossary").
A further important topic in this context is that the stored data with a structure must be checked - something that is often not done today. The XML schema, for example, is a proven method for guaranteeing the correctness and completeness of an XML file. Errors caused by unchecked data can be very serious.
Consistent data verification is therefore essential. Last but not least, the data must also be converted into each other using rules. There are also many possibilities for this today, one of the best known is certainly the programming language XSLT (see also the "Glossary"). But there are also other sets of rules.
Instead of Destroying Content....
Anyone who wants to further increase the degree of the automated document processing in customer communication in the sense of the next stage of digitization must ensure structured, consistent and centrally available data. For automated document processing and output management, this means preserving the content of documents as completely as possible right from the start instead of destroying it - as is often observed in the electronic inbox of companies, for example.
The problem here: In many companies, incoming e-mails are still "typed", i.e. converted into an image format, in order to subsequently make parts of the document content interpretable again by means of OCR technology. It's "Deepest Document Middle Ages." It wastes resources unnecessarily, especially when you consider that email attachments today can be quite complex documents with tens of pages.
Above all, however, this media discontinuity is tantamount to a "data gau": electronic documents (e-mails), which in themselves could be read and processed by IT systems, are first converted into TIFF, PNG or JPG files. So "pixel clouds" arise from content. In other words, the actual content is first encoded (raster images) and then made "readable" again with difficulty using Optical Character Recognition (OCR). This is accompanied by the loss of semantic structural information, which is necessary for later reuse.
How nice would it be, for example, if you could convert e-mail attachments of any type into structured PDF files immediately after receipt? This would lay the foundation for long-term, revision-proof archiving; after all, the conversion from PDF to PDF/A is only a small step.
...Rather Preserved Than the Basis for Further Automation
The following example: A leading German insurance group receives tens of thousands of e-mails daily via a central electronic mailbox, both from end customers and from external and internal sales partners. Immediately after receipt, the system automatically "triggers" the following processes:
- Conversion of the actual e-mail ("body") to PDF/A
- Individual conversion of the e-mail attachment (e.g. various Office formats, image files such as TIFF, JPG, etc.) to PDF/A
- Merging of the e-mail body with the corresponding attachments and generation of a single PDF/A file per business transaction
- At the same time, all important information is read from the file (extracted) and stored centrally for downstream processes (e.g. generation of reply letters on an AI basis, case-closing processing, archiving).
Everything runs automatically and without media discontinuity. The clerk receives the document in a standardized format, without having to worry about preparation (classification, making legible).