Tuesday, December 12, 2006

COM and IDL and C++ and IFilter, oh my!

It's a blast from the past! On the Developmentor Advanced .Net mailing list recently, there was a bit of discussion about doing text extraction from Word documents. This is something I did in a past life and let me tell you it wasn't any fun.

First we tried using WinWord as an automation server. We would open the file, then save as text. This leaked resources horribly and was slow. It eventually would just lockup and the DCOM would go to sleep... not good for a product (at that time) based on a MTS/COM+ server running RDO components under IIS 5.

Next we tried about twenty or thirty "conversion" programs that were supposed to allow batch or individual conversion of file. This was even worse than using WinWord as the fidelity of the conversion was horrible at best.

Then we tried a couple of commercial libraries for data conversion. They were better at getting the text, but honestly less stable than WinWord.

Then it hit me, in the shower as all good ideas do, doesn't Microsoft's Index Server do a ripping good job sucking out the text of documents to index them? Obviously that can't be unstable or Index Server's crawller would die all the time. So I delved a bit into how Index Server works. Turns out that it loads a COM component and uses the IFilter interface to ask a pluggable filter to extract the text. All I had to do was find an example of an IFilter consumer. Trouble was, there weren't any... sure you could find a ton of IFilter providers to install and enable Index Server to crawl the files, but good luck finding an example of calling one.

Eventually I worked it all out and wrote a COM component of my own in C++ that loads the right IFilter driver for the file, then sucks all the text out. It was COM because the client program was a VB6 service.

This really rocked because all I had to do was find the right IFilter drivers for a document format and away it went... no configuration, just joy. And lest you think this is all old news and worthless, Microsoft's Desktop Search still uses IFilter magic. Google Desktop Search also can use IFilters with the appropriate plug-in. These days you can find plug-ins for everything from AutoCAD drawings to MP3 files (not to mention all the big-boys, like Word, WordPerfect, OpenOffice, PDF, etc).

So with the recent discussion on the ADVANCED-DOTNET list, I offered and was asked to dust off the component and publish it for your use. I had to do a tiny bit of cleanup and conversion to be usable with Visual Studio, but it's up for your pleasure. It can be used from .Net programs, of course, though I might eventually get around to rewriting the tiny amount of real code as a .Net library. I haven't looked into the quality of this, but there is something on CodeProject.

Download here.

5 comments:

IDisposable said...

I've hosted this project on CodePlex. See this.

Marc Brooks said...

It might be possible, but I haven't played with this stuff in a long time...

wei min said...

Hi Marc,

I've tried your ifilter implementation for c++ and it works great for most cases.

However, i've encountered a problem with the latest pdf 10.1 ifilter. It seems to fail at loadIFilter. The strange thing is that this ifilter works fine for windows search. Btw, I tested it on Win7, not sure if this would affect anything.

Any ideas?

wei min said...

Hi Marc,

I've tried your ifilter implementation for c++ and it works great for most cases.

However, i've encountered a problem with the latest pdf 10.1 ifilter. It seems to fail at loadIFilter. The strange thing is that this ifilter works fine for windows search. Btw, I tested it on Win7, not sure if this would affect anything.

Any ideas?

Anonymous said...

Adobe Reader 10 (X) does not come with an IFilter anymore. Only the 9.5 version does.