Monday, August 17, 2009

Content Migration - How we got a project over a finish line 16.9 times faster

Last year we ended up migrating 38 web sites and major site sections (10,710 pages in total) in just over a week each. A Nation-Wide reputable vendor estimated each of them to take 3 to 6 months. How did we do it?

The short answer is this: we used Batch Loader. Easy enough. Am I simply comparing a manual import to the use of a tool? Nope. I'm not about to waste your time. After all, that vendor was also using batch loader.

Batch Loader Turbo Charged

When it comes to a mass-check in - batch loader is a nice and useful tool but when you're loading tenths of thousands of files from dozens of locations and when each file has unique derived values in its metadata - Batch Loader won't be of much help.

I guess that almost any enterprise-scale content migration will have you fall flat on your face if you're simply relying on Batch Loader to "magically" load your content into ECM.

So what is the quickest way to automate such a migration?

Its simple. The answer becomes obvious when you look at HOW the batch loader works. It processes a typical HDA file - one record at a time. It picks up a file from the location you specify in primaryFile field, sets metadata values to the ones you tell it to use and calls a Check In service. Again, it reads batch loader script one record at a time and checks in files - one by one.

What if we could create some cool batch loader script that will import all the files we want imported? All at once! Sure, that would be nice, but how do we go about creating one?

The Batch Builder utility that comes with Batch Loader is very limited. It builds a very simple files based on a content of a single directory lets you use file system data as meta. It won't let you pick up files from multiple locations or create complex meta values.

So, here's the biggie - to turbo-charge your content migration effort, you need may consider GENERATING your own batch loader scripts.

How to use Code Generation effectively

How do you go about generating it? For simple migration you can get away with using your editor's search and replace function on a comma-separated list of files

Let's say, your excel file has the following columns:

Content Id, Title, Author, Security Group, Account, Doc Type, Date, File Path

After you save it in a comma-separated (CSV) format, you'll end up with something like this:

A2561405, Migration Project Plan, Bill, Public, , abstract, 8/12/09 4:20 PM, C:/Migration/Project Plan v.3.4.doc

Now, you could use a RegEx like this to produce a batch loader script out of your CSV file:

Replace this:

^([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)$

( If you're new to regular expressions, this says:

- Begin at the start of a line
- Select every character until you see a comma - repeated 8 times
- You must be at the end of the line )

With this replacement string:

dDocName=$1\n dDocTitle=$2\n dDocAuthor=$3\n dSecurityGroup=$4\n dDocAccount=$5\n dDocType=$6\n dInDate=$7\n primaryFile=$8\n <<EOD>> \n

You may need to test the RegEx in your own editor as every one has a slightly different syntax.

After you run it - your comma-separated line will transform into an HDA entry that will look like this:

dDocName=A2561405
dDocTitle=Migration Project Plan
dDocAuthor=Bill
dSecurityGroup=Public
dDocAccount=
dDocType=abstract
dInDate=8/12/09 4:20 PM
primaryFile=C:/Migration/Project Plan v.3.4.doc
<<EOD>>

I hope, you get the idea.

How to scale it up

You can easily adapt this technique to any complexity. Just use Perl, Ruby or another scripting language of your choice to generate metadata values and the file names and locations.

Be sure to use subroutines, structure your code well and store it in source-control system. Code generation script can get quite complex quite quickly.

Important last minute tips


Today, there will be three:

- You'll very likely need to debug your code generation script and run your batch load file more then once so be sure to:
  • Add a custom meta field or a special value like "batch_loader" for the dDocAuthor so you can find your files quickly and delete them when its time to start fresh
  • Test on a small sub-set (under 200 items) so you don't have to wait for 8 hours for these 500 Gb to import
- Be sure to CLEAR the "Clean up files after successful check in" box. If you leave it checked - your source files WILL be deleted and you won't find them in a Recycle Bin!

- Be sure to MARK "Enable error file" box. This will create a detailed log file and a smaller batch loader script file for the files that didn't load. Absolutely essential!





That's all for now.

Happy Migrating!