Tuesday, September 4, 2012

Secrets of the Full Text Search

Spend 15 min to understand exactly how Content Server' Full Text search is working!

This article takes you behind the scenes and shows you exactly how the full text indexing works in Oracle Content Server. If you ever tried to troubleshoot your search, indexer, batch loader or a performance issue – without a full understanding of how the things really work under the hood – I don’t have to tell you how much time this article can really save you. So without any further due – here it comes.

Before you try to follow these steps, make sure that you have SearchIndexerEngineName=DATABASE.FULLTEXT set in your config.cfg. Oracle Text Search will work in a very similar fashion, even though there’re some minor differences. Now for the 10g and 11g systems - the process is the same and the steps below will apply to both.

Now that you checked that you have your full text search enabled, let’s begin. First, enable the "systemdatabase" and "indexer" verbose logging. Go to your System Audit Information and make sure you have the options selected like this:

Ready? Now, let’s set the stage by checking in a PDF document. I’ve created a small PDF for this demonstration:

Check it in and then go back and refresh your server output. Search for the content id of the PDF document you’ve just checked in. You'll see that the indexing process has been initiated (See screenshot below)

You'll see the application/pdf - MIME type being accepted as an indexable format. This means that the document was either successfully checked in or successfully converted by the inbound refinery.

Now let’s take a look at the next couple of lines. The Content Server is then spawning a new process - the textexport tool, that comes with Oracle Outside In functionality and comes with the ContentAccess component - part of the standard Content Server distribution.

The textexport program reads the file, extracts the text and saves it as a text file in the active search collection folder - i.e. /IdcCol2/bulkload/~export

If you continue to follow your server output, you'll also see the name and the path to the temporary text file. I'm showing the file and its content here:

(See screenshot below) 

Now that the text is extracted - it is then placed into your database table (IdcCol1 or IdcCol2) - along with the metadata.

Before we go on, let me tell you something else, it's important. For performance and storage efficiency reasons only the latest revision of the document is stored in the IdcColx tables. This means that the full text searches are only done on the latest released revision.

The first step in updating the idcColx table is then to remove any reference to that content ID, if it already exists - to make sure that only one (latest) revision is present. Since we did a new checkin, the attempted delete operation returns "0 Rows Affected." (See screenshot below) 

Now if we're dealing with a delete, the newly deleted revision is then removed from an IdcColx table, the text from the new latest revision is extracted and is inserted as the new full text search record - along with its metadata.

We're now ready to insert metadata and full text into the database table. The text that was extracted out of the document is inserted into the dDocFullText field, which is a BLOB field. Below is the actual SQL statement that indexer is using to do just that: 

Note that the actual text file path is used on the SQL statement! Then also note how in the last line of output above the revision state is changed to "I" to mark it as indexed.

And here's how the end result looks like if you run a select statement on the database:

While at the database, let’s check how full text index is built on the IdcColx table:

Note that the index is auto updated on commit, so as soon as the row is inserted into the table, it is ready to be returned in searches. Also note that the only full text indexed field is the dDocFullText one.

Ok, we've just traced the process of creating an index. And now let's look at...

How the index is used

Bring up your Content Server home page or a query builder page, so you see a full text search box (See screenshot below) 

(You can also use the quick search box on the top right.)

Got the page loaded? Gut. Now clear your server output before running your search so you get to see exactly what happens in the database (See screenshot below)

So that was the SQL statement. And now let me give you one last important insight.

The full text extracted from the document is tokenized per word, so the search is actually looking for word hits. So searching for "test" will return our item and searching for "te" will not. Actually, this doesn’t have much to do with the Content Server. Let me show you what the actual SQL statements will return:

Searching for "test" will happily return our record :

... and searching for "te" comes back empty-handed

I hope I've given you some useful insight and you feel clear about how things actually work with the full text indexing. Now it might be a good time to check out other free insider tips and popular Web Center technical articles, and also find out what specific techniques allow us to achieve killer results for our clients - even with the smallest budgets. Check out at http://bit.ly/TdKLA7

No comments:

Post a Comment