Converting documents

SeedDMS is for storing documents like any other document management system (DMS). Converting documents into other formats may seem like some extra requirement not really needed. But, to make real use of a DMS, converting documents is inevitably. The most obvious use is the creation of preview images. User expect those images for each document, because they make it much easier to scan quickly through a list of documents, identifying the one searching for. Hence, there must be some conversion to turn a document into an image format known by the browser (e.g. png). User also expect to be able to search for phrases in the content of a document. Hence, there must be some conversion to extract the text from a document and feed it into an index. User may also want to print documents without much hassle. Hence, there must be some conversion to pdf, which most printers nowadays can process.

SeedDMS can do all that, based on a very generic conversion mechanіsm, able to convert any type of document into any other type of document, if a so called conversion service can be implemented. A conversion service contains one or more conversions. Each conversion defines the mime type of the source and the target document and any number of additional parameters, e.g. the width of a created preview images, the language used for OCR, etc.

The oldest, and for a long time only conversion service ĭn SeedDMS (though it wasn’t called a conversion service) is the ‘Exec’ conversion service. It just runs an external command which is performing the conversion. On Linux, there are many of those commands available, e.g. pdftotext, docx2txt, or convert. Traditionally, those commands are configured in the settings of SeedDMS. There are three sections for commands converting documents into

text/plain(used for the fulltext search),
image/png (used for preview images),
application/pdf (used for a full preview of the document and for printing)

The file doc/README.Converter.md shipped with SeedDMS has a long list of example commands. They have all been tested on Linux and most Linux distributions contain them. Btw, this is one reason Linux is the prefered operating system to run SeedDMS.

Basically, the ‘Exec’ conversion service could be used for any kind of conversion, because a command can even be a shell script implementing conversions not available as a single command. Anyway, there are cases where a more dedicated conversion service implemented in PHP is more suitable and often easier to setup. These services can be implemented as an extension.

SeedDMS already includes some conversion services for the most basic conversions:

conversion from various image formats into png (for preview)
conversion from pdf, tiff and svg into png (requires php-imagick)
extracting iptc data from jpeg images (for fulltext search)
conversion of various text formats (html, markdown, rst) into text

To get a list of conversion services available in your SeedDMS installation, you need to turn on the debug mode in the settings and then open the page ‘List of conversion services’ in the ‘Debug’ menu within the admin area. The first list on that page just contains the mime types of documents currently available in your installation and whether those types can be converted into preview images, text, and pdf documents. The second list contains all available conversions, especially those based on external commands.

SeedDMS itself does not use any other conversions than those into image/png, text/plain, and application/pdf, but there are actually no restrictions on the input and output format.

Some examples

When first thinking about converting a document from on format into another format, one would expect the output format to be an equivalent of the input format, e.g converting an office format into pdf should look identical without loosing any information. In many cases this isn’t required and sometimes it may even be undesired, e.g. if the document is to be added to the full text index.

Audio files

Let’s have a look at an audio file which should have an recognizable preview image and a text representation to be fed into the fulltext index. It appears impossible to turn an audio file into an image or a text, but it’s actually not that difficult if you do not consider that conversion as creating a perfect copy, but something to recognize that document easily and to be able to find it quickly with the full text search. How about taking the album cover of that song, if it has one? You may as well create an image visualizing the frequencies over time. The purpose of the text conversion is always to be able to easily find that document in a later search. So, how about taking the meta data of an mpg3 file or even using the lyrics, if they exist.

Problems

Well, talking of problems may not entirely right, but some conversion causes unexpected headache and you just need to be aware of it.

Converting PDF to Text

Converting pdf documents into plain text appears simple, it often is, but it can be quite complicated. There are various commands to extract the text from a pdf with a single command. So, what could possibly go wrong? This pretty much depends on what your pdf document consists of. Is it just one image embedded into a pdf, or several images, or does it contain text as well?

extracting text from a scanned image returns no text unless the document was run through OCR software adding a text layer to the document
ocrmypdf can add such a text layer to a scanned document, but it will only do it, if the document does not contain text already
if a document contains text and images, the text extraction will only extract the text but disregard any text on the images

Problems with hyphenation

Another potential problem which isn’t obvious at first sight are hyphenated words. Text in pdf documents is just a ѕequence of letters. There is no notion of paragraphs or setences and not even words. A hypenated word falls appart into two words separately indexed. E.g. ‘Computer’ may become ‘Com-’ and ‘puter’ and searching for ‘Computer’ will not find any of those.

Headers and footers in Office Documents

Converting Office Documents into text for full text indexing is actually easy, e.g. with docx2txt. But if you compare different conversion you will realize that headers and footers are treated differently. Sometimes they are skipped completely but they may as well appear once for every page. This makes a difference in searching if the word frequency is taken into account by the full text search.