AFP to PDF Conversion

DBV-Winterthur

From AFP to PDF at 70 pages a second

The decision by the board of DBV-Winterthur Versicherungen in 2002 initially sounded unspectacular: Agents and employees should have access to the company’s archive over the Web. Whether processes, correspondence, or contracts: With their browser and Acrobat Reader they should be able to search AFP documents in the company’s FileNet archive. New rules and standards had to be set and implemented to achieve this goal: In view of the large number of users and archive accesses as well as the need to support short user Intranet response times, the request for proposal specified a translation rate of at least 70 AFP document pages a second on a Sun Sparc-3.

The Challenge

Employees of DBV-Winterthur write up to 20,000 letters a day to their customers. In addition, an IBM mainframe produces tens of thousands of circular letters to be sent to insured clients. The requirement was not just to print these letters, but also to file them simultaneously in the FileNet archive. In this way, not just employees but also agents would soon be put in the position of being able to search the documents. Due to the huge number of documents in the archive, a decision was made to use AFP as the archiving format, because of the resource concept of AFP documents in this format are very small. Additionally it is to be expected that AFP will continue to be supported by IBM in the future, ensuring that this format can be used to archive and reproduce documents long-term.

Due to the mix of users (employees, agents and clients) users should be able to access documents using a standard Internet browser. Presentation in standard AFP format was discounted since the downloading and installation of an AFP plugin for the browser was not considered acceptable to external users. This constraint meant that requested documents had to be displayable by all users without any additional prerequi-sites, implying conversion into this format before transmission.

For these, mostly multi-page documents, only PDF came into question as the target format, since it can nowadays be assumed that every user has Acrobat Reader installed on his PC. The real challenge therefore was how to convert AFP documents to PDF on the fly fast enough to maintain the short response requirements for user access, even for the 1,000 simultaneous users targeted at project completion.

For this reason DBV-Winterthur set these guidelines for bidders:

  • The solution must be capable of converting at least 75 pages a second (the exact requirements were: 5 documents of 15 pages per second or 8 documents of 9 pages per second respectively) on a 750 MHz Sun Solaris Sparc-3 with two processors.
  • The total access time for displaying an archived AFP document on the Internet may not exceed one second.
  • The appearance of the PDF document should not deviate from the original AFP

The Solution

To fulfill these requirements was in reality non-trivial. The project team decided after much discussion and testing, that not just a high performance conversion product was required, also a concept reaching beyond a standard solution. The initial tests showed DBVWinterthur that Compart’s DocBridge Toolkit was most convincing in terms of throughput, stability, reproduction fidelity, and architecture. Initially the product could only attain the required throughput on a fast Windows machine, but the Compart development department soon identified ways to reach the required throughput on a Sun Solaris of the type specified.

The goal was accomplished with two measures: Firstly the product code was optimized to improve the internal resource caching, secondly Compact recommended the use of an improved version their AFP Resource Manager for resource analysis prior to archiving.

Further tests showed that target throughput was obtainable if resources referenced in AFP documents were quickly accessible by the server. Each opening of resource data meant an addition I/O access causing the whole conversion process to be crucially slowed. So, the number of resource files to be opened should be kept as small as possible. This can preferably be reached by holding all of the resources referenced in an AFP document in a single resource file. In this case the resource file used to resolve references is to be found in the cache of the conversion server and can therefore be read very quickly by the processor.

The throughput should not just be possible for a single document, but also for successive documents as well. These may differ from previous documents in the following way:

  • An additional resource appears:
    One with a new, as yet unused, resource name. In this case it makes sense to extend the resource file with this extra resource, even if the new document does not reference some other resources in the resource file.
  • A resource is changed:
    It has the same name as the one it replaces, but has different contents. In this case, a new resource file must be made, so that the reference with the same name can correctly refer to the different associated resources.

This process of resource enhancement also ensures that adding a new resource does not require the creation of a new resource file, which must be opened when retrieving.

It was also recommended to use the Compart AFP Resource Manager to analyse the resources of each print job. Considered on its own, a print job generally contains similar documents, all of which refer back to the same AFP resources. Given that documents in different print jobs seldom share common resources or even if they do they use different reference names, it makes sense for documents in different print jobs to use separate resource files. In this way a changed resource only links to a new resource file for a specific print job, while the resource files for other print jobs remain unaffected. Because a resource file for a print job is much smaller than a resource file for all print jobs together, the size increase from additional resource files is significantly smaller.

This method limits the memory space reserved for cache in the conversion server for resource files, very effective – also benefitting from the fact that the resources for AFP documents at DBV-Winterthur do not often change. Unlike other AFP resources the AFP font resources are collected in a separate resource file used for all AFP documents. They are relatively large and as a rule do not change over many years. This suggests they should therefore be in a separate resource file and held permanently in the server cache. The figure on page 14 shows schematically how resources at DBVWinterthur are collected and read.

This is how the throughput target was achieved at DBV-Winterthur: In the final test the solution attained a speed of 116 pages a second (6.2 documents of 18.7 pages) or 58 pages per second (18 documents of 3.27 pages) respectively, converted from AFP to PDF – and DBVWinterthur decided in favor of the Compart solution.

Implementing the Solution

To realize the complete solution the following two applications had to be implemented:

  • As recommended by Compart, the AFP Resource Manager had to be built into the AFP archiving of the FileNet archive, to create the required resource files.
  • For document search and display in a web browser it was necessary to program a web retrieval application.

“For resource extraction and creation of resource files, the AFP Resource Manager was implemented to work with the FileNet AFParchiver”, says Klaus-Dieter Häuser, Responsible Official of the DMS-System at DBV-Winterthur. “For this he analyzes and compares binary the resources with the resources of the previous resource file.” Auditing was done for each newly archived document or document spool:

  • If an additional resource with a new name is in the data stream: Then the resource is added to the corresponding resource file.
  • If the additional resource has the same name and content as another resource: A binary compare is done, if there is a change, a new resource file is created, otherwise an existing resource file is assigned to the document, one with identical resources.

After testing all resources for an AFP spool the name of both resource files (one without fonts and the font resource file) the AFP Resource Manager inserts a TLE for each AFP document (TLE = Tag Logical Element: non-printable element in an AFP data stream), for both resource files in which all resources for the document are to be found: the resource file not containing the font resources and the font resource file. In this way an application can extract from a document the name of the resource file in which the referenced resources can be found.

Furthermore, the AFP Resource Manager provides the associated name of both resource files to the FileNet archive for each document. It is here that for each document an additional index for a resource file name is reserved, so it can be read without having to analyze the AFP document data stream.

The AFP Resource Manager writes the font resource to a separate font resource file. The resource files are given to the FileNet AFParchiver for archiving and the conversion server for filing on one of its hard disks, for faster access during conversion.

The client retrieval application is different by user group: In-house users select documents on their PC using a servlet. (Information for experts: The servlet is started by an ActiveX component in Internet Explorer which screen scrapes the document ID grabbing it from a CICS application.) The servlet calls an Enterprise Java Bean which itself gets the selected document (if necessary including the resource file) from the FileNet archive, converts it to PDF using the DocBridge Toolkit and displays it to the user. On the other hand, for display by an agent a Java application is planned to provide the user with a search template in his browser and sent his search request over a web application to the host case folder archive. This archive returns a results list from which the user selects the desired document.

The application server asks for the selected AFP document and its resource indexes from the archive server. “The Java application on this server integrates the functions of the DocBridge Toolkit: It reads in the AFP document and analyses the references to the resources in the data stream.” says Häuser. “To resolve the AFP references, it opens the resource file, the name of which is to be found in the resource index – unless if the resource file is already in cache from a previous access, and therefore quickly to hand. The document put together this way is finally converted by DocBridge Toolkit into PDF.” The PDF is sent to the Web server, which in turn passes it to the browser of the requesting user. There the document is displayed in the browser’s Acrobat Reader window.

The implementation of the Web application was undertaken by ELSAG from Villingen-Schwenningen. More than 80 million documents are now accessible via the Web from the archive.