Full text index

Storing documents in a Document Management System like SeedDMS helps in organizing your scanned paper work, digital media, or any other electronic resources you would like to keep at a central place and possibly share with other people. When setting up such a system you will have to define some schema how to organize the documents. Since SeedDMS offers folders just like a regular file system, this is quite often the preffered way, taking into consideration, that users are well acquainted to such a hierachical folder structure. Not seldomly such a structure is sufficient enough to quickly find certain documents, even within several thousands of documents. Anyway, one of the key features of all DMS is searching for documents to quickly get what your are looking for. SeedDMS supports two kinds of searching

  • Database search
  • Full text search

Database search is based on the meta data of each document and folder stored in the database of SeedDMS. Hence, documents will be searched by its title, comment, owner, keywords, custom attributes, etc. but not by its content. There is nothing to be configured, it is just working out of the box and search results always reflect the current state of the database.

Full text search not just takes the meta data, but also the content of a document into account. It needs to be configured and the index must be updated regulary. Folders are also added to the full text index, but of course do not have a content. Full text search is much faster than database search because of its optimized way to store the index. One the other hand it often returns many more hits making it difficult to find a particular document.

Full text search in SeedDMS requires to first turn it on and second to select a full text search engine in the settings. There are two engines shipped with SeedDMS:

  • Zend Lucene (obsolete)
  • SqliteFTS

Lucene is still around, but is based on the very old Zend Lucene library which is obsolete for some time now. SqliteFTS is currently the way to go. It is also faster than Zend Lucene. There is also an extension using Solr provided by MMK GmbH which is not freely available.

Once the full text search engine is set and full text search is enabled, the full text search index (often just called index) must be build. Depending on the number of documents and folders this can take minutes to hours. It can be done while SeedDMS is used by users, though it may impact the responsiveness of your server. During building the index, the full text search can be used already, but of course, results can be incomplete.

The crucial point about full text search іs the content of the documents. It may appear obvious what the content of a document is, but at second glance it’s not anymore.

What is meant by content of document

Having a document like this one, it is quite clear what the content of it is. It is the text starting a the title way down to the end of the document, disregarding any formating and keeping just the plain text. In this particular case it is written in markdown which is already close to plain text, but there are still some formating instructions which must not be indexed. A LibreOffice Writer document contains text as well, but extracting it, is not that simple anymore, also raising questions like how to handle header and footer if they exist. Are they part of the extracted plain text? Other files, e.g. an audio file does not have any text, but wouldn’t it be great if you could find your favorite songs by its lyrics? For scanned images containing text, OCR is required to extract that text.

Considering the enormous amount of different file types, it is impossible for SeedDMS to have a clue on how to extract the indexable text from for each of them. That is why SeedDMS just implements a conversion service which takes care of turning a file into plain text. This is the same service that converts files into PNG (for preview images) and PDF. It could actually be used to convert from any mime type into any another mime type. The conversion service could do its work fully in PHP, but quite often it is just a wrapper calling an external command, e.g. pdftotext.

For a good and complete full text index, you should have a converter from the mime types in your SeedDMS into text/plain. This can either be an extension or an external command configued in the settings. There are lots of examples in the file doc/README.Converters shipped with SeedDMS. Btw, this is one reason for using Linux as your server’s operating system. There are simply so many easily installable programms for converting files.

SeedDMS will always and only index the content of the last document version, older version will not be considered.

Creating and updating the index

There are various ways to update or create the full text index. If you have a reasonable low number of documents (several thousand and less) you could just use the web UI, but sooner or later you will look for possibilities to do this automatically. You may wonder why updating is required at all? Couldn’t this be done in the background whenever a document is changed, added, updated, etc.? The answer is yes but this has limits. Changing the title or comment of a document will imediately update the index, but moving folders with subfolders and documents will not update the index, because this could take an unforeseeable long time and slow down the actual operation. Impatient users quickly get the impression the system has crashed. Hence, there is some automatic updating of the index required which indexes all those documents which need to be.

Web UI

The index can be created and updated from with the admin tools. This will first scan all documents and builds a folder tree which is than processed. It starts at the bottom of that page and issues an error message for each document which could not be indexed. That page manages a queue of at most 5 entries to be indexed. Keep that page open, because newer browsers will stop any javascript if the browser window is minified or the tab is not active anymore. This will stop the indexing.

Task

Instead of reindexing the documents manually, it can also be done with a task in SeedDMS 6.x. Just activate it in the scheduler and make sure your scheduler is frequently running (Set up the scheduler). The task can either recreate or update the index. A good plan is to update at most hourly and recreate at most daily.

Old cron job

For those still running a SeedDMS 5.1.x there is a shell script named seeddms-indexer located in the directory utils. Just run it within a cron job or on the command line. seeddms-indexer -h will output a list of command line options.

Searching

Once a full text index is created it can be used for searching. There are two tabs on the search page titled Full text search and Full text search (facetts). The second one was added in 6.0.24 and implements a different approach based on facetts, while the first one is closer to the database search. Both search within the content, title, comment, keywords and can filter the result by other meta data.

Pitfalls

Because of the way full text search is implemented in SeedDMS (though this isn’t much different from other systems), it has some pitfalls you need to be aware of. In contrast to database search a full text index needs to be updated when a document or folder has changed. SeedDMS tries to do that as good as possible, but some operations are so wide reaching, making it impossible (or inefficient) to update any affected document in the index. Recreating the index once in a while is inevitably.

Adding a new custom attribute

Of course, custom attributes are indexed as well. Unfortunately, the schema of the SqliteFTS index is created when rebuilding the index. A custom attribute added afterwards will not be part of the index. Hence, when adding new custom attributes you should rebuild the index afterwards.

Changing access rights or adding a new user

The index not just contains meta data and the content of the document, it also stores read access rights on documents. That makes searching even faster, because the result set can be easily reduced to documents accessible by the logged in user. The downside is obvious. Any change of access rights, especially those affecting many documents due to inheritance may require to update the affected documents. This is not done automatically and rebuilding the index is inevitably.

Don’t worry, this will not result in listing documents without read access right by a particular user, but it may lead to incomplete result sets or hit counts.

A similar problem occurs when a new user is added. This user will not be able to find any documents until the index is rebuild.

What we did not consider

This document is far from being complete. There are many aspects not covered, because it is out of scope of this article.

  • Converters to text/plain may spit out lots of phrases and terms you may not want to index.
  • You will also have to ensure to only index utf8 encoded data.
  • In some cases it may not be reasonable to index the whole document but only parts of it.

Conclusion

Document management without full text search is not document management. Especially, when the number of documents raises, it is quick way to find what your are looking for. The downside is its extra configuration work for initially setting it up and afterwards updating the index. Since in SeedDMS both database and full text search may be used interchangeably, there is no reason to turn full text ѕearch off.