Spend 15 min to understand exactly how Content Server' Full Text search is working!This article takes you behind the scenes and shows you exactly how the full text indexing works in Oracle Content Server. If you ever tried to troubleshoot your search, indexer, batch loader or a performance issue – without a full understanding of how the things really work under the hood – I don’t have to tell you how much time this article can really save you. So without any further due – here it comes.
Before you try to follow these steps, make sure that you have SearchIndexerEngineName=DATABASE.FULLTEXT set in your config.cfg. Oracle Text Search will work in a very similar fashion, even though there’re some minor differences. Now for the 10g and 11g systems - the process is the same and the steps below will apply to both.
Now that you checked that you have your full text search enabled, let’s begin. First, enable the "systemdatabase" and "indexer" verbose logging. Go to your System Audit Information and make sure you have the options selected like this:
Ready? Now, let’s set the stage by checking in a PDF document. I’ve created a small PDF for this demonstration:
Check it in and then go back and refresh your server output. Search for the content id of the PDF document you’ve just checked in. You'll see that the indexing process has been initiated (See screenshot below)
You'll see the application/pdf - MIME type being accepted as an indexable format. This means that the document was either successfully checked in or successfully converted by the inbound refinery.
Now let’s take a look at the next couple of lines. The Content Server is then spawning a new process - the textexport tool, that comes with Oracle Outside In functionality and comes with the ContentAccess component - part of the standard Content Server distribution.
The textexport program reads the file, extracts the text and saves it as a text file in the active search collection folder - i.e.
If you continue to follow your server output, you'll also see the name and the path to the temporary text file. I'm showing the file and its content here:
(See screenshot below)
Before we go on, let me tell you something else, it's important. For performance and storage efficiency reasons only the latest revision of the document is stored in the IdcColx tables. This means that the full text searches are only done on the latest released revision.
The first step in updating the idcColx table is then to remove any reference to that content ID, if it already exists - to make sure that only one (latest) revision is present. Since we did a new checkin, the attempted delete operation returns "0 Rows Affected." (See screenshot below)
We're now ready to insert metadata and full text into the database table. The text that was extracted out of the document is inserted into the dDocFullText field, which is a BLOB field. Below is the actual SQL statement that indexer is using to do just that:
And here's how the end result looks like if you run a select statement on the database:
Ok, we've just traced the process of creating an index. And now let's look at...
How the index is used
(You can also use the quick search box on the top right.)
Got the page loaded? Gut. Now clear your server output before running your search so you get to see exactly what happens in the database (See screenshot below)
The full text extracted from the document is tokenized per word, so the search is actually looking for word hits. So searching for "test" will return our item and searching for "te" will not. Actually, this doesn’t have much to do with the Content Server. Let me show you what the actual SQL statements will return:
Searching for "test" will happily return our record :