This post has been written by Aquaforest Guest Blogger, Agnes Molnar, Founder and Managing Consultant, Search Explained.
Enterprise Search is intended to help users finding relevant and valuable content to get their jobs done.
However, when focusing on the key expression of this intent, “relevant and valuable content” – we instantly have to face the first challenge: how “content value” can be defined?
Content value
Basically, there are three factors that make content valuable:
- Timeliness – In most cases, users want to get the latest version of documents (unless they explicitly specify otherwise). Usually, the older a document is, the less value it has.
Some examples where timeliness is critical: policies, procedure descriptions, document templates, manuals, etc. In other cases, finding the old content has business value. For example, legal documents, invoices, archives, etc. - Accuracy – The content is accurate if it is carefully prepared, precise, exact, and consistent with the company’s standards and rules. We can improve accuracy by not only improving the content but also by adding more, accurate metadata.
- Completeness – Having all the needed information, and not missing any relevant data make a document complete. A document being complete means it has integrity.
From findability perspective, the more of the content is available to read, crawl and index, the more complete the document can be. It is evident that we should make the whole content readable to support better findability. Also, readability of the content makes better accuracy possible, too, therefore we can increase the benefits even more.
Improving content value for better findability
After defining these characteristics, the next question is how to enhance content value to support better findability.
To make it easier to find the latest and timely content, we have to make sure outdated, old, legacy and archive content is either removed from the search index (if they are absolutely unnecessary), or filtered out from the default user interface and only available on a separate page.
To make the content more accurate, there are three things to do. First of all, we have to make sure its quality is good and valuable for (human) subject matter experts. Second, we have to make the content complete. Last but not least, we have to add as much relevant metadata as possible.
In many cases, completeness is the weakest point of content quality, especially when the document is a scanned, non-OCR’d picture or PDF file. In these cases, the content is there, readable and consumable for humans, but doesn’t contribute to the content’s value as it is not readable for the content processing engine. To improve completeness, the first thing to do is making sure the content of the document is machine-readable.
The key is: OCR
Making the content machine-readable is the key to having better content completeness as well as accuracy. This can be done by processing them with OCR technology (Optical Character Recognition), to create a text version of the file contents. This enables the file to be searched and found.
Besides the obvious and immediate benefits of OCR-ing these documents, it has a common side-effect, too, namely that after making the content machine-readable, more metadata can be extracted and created automatically. Therefore, the accuracy of the document will be boosted, too.
Don’t forget: search engines can read the text only. They cannot understand images or the content creator’s intent. Therefore, creating and generating as much textual information as possible is essential for getting these contents processed, and making them findable. OCR technologies are here to help.
[If you want to learn more about the related technologies and available tools, please check Aquaforest SearchLight.]