Friday, December 29, 2006

IFilter Text Extracter now hosted on CodePlex

My IFilter Text Extractor that I mentioned previously has now been uploaded to it's new CodePlex home. If you have any enhancement requests or problems, please use the appropriate forums on CodePlex.

In VB.Net, the code to extract up to 64K of text from the file C:\Test.doc is as simple as

Dim te As ExtractTextLib.TextExtractor = New ExtractTextLib.TextExtractor
Dim sText As String
sText = te.ExtractText("C:\Test.doc", 64 * 1024)

ASP.Net RSS Toolkit published on CodePlex.

In February of 2006, Dmitry Robsman released a very cool RSS Toolkit that eases both consuming and publishing RSS feeds in an ASP.Net application.

The RSS toolkit includes support for consuming as well as publishing RSS feeds in ASP.NET applications. Features include:

  • RSS Data Source control to consume feeds in ASP.NET applications
    • Works with ASP.NET data bound controls
    • Implements schema to generate columns at design time
    • Supports auto-generation of columns at runtime (via ICustomTypeDescriptor implementation)
  • Caching of downloaded feeds both in-memory and on-disk (persisted across process restarts)
  • Generation of strongly typed classes for RSS feeds (including strongly typed channel, items, image, handler) based on a RSS URL (the toolkit recognizes RSS and RDF feeds) or a file containing RSS definition. Allows programmatically download (and create) RSS channels using strongly-typed classes. The toolkit includes:
    • Stand-alone command line RSS compiler
    • Build provider for .rssdl file (containing the list of feed URLs)
    • Build provider for .rss file (containing RSS XML)
  • Support for generation of RSS feeds in ASP.NET application including:
    • RSS HTTP handler (strongly typed HTTP handlers are generated automatically by the build providers) to generate the feed.
    • RSS Hyper Link control (that can point to RSS HTTP handler) to create RSS links.
    • Optional secure encoding of user name into query string to allow generation of personalized feeds
  • Set of classes for programmatic consumption and generation of RSS feed in a late-bound way, without using strongly typed generated classes

The toolkit is packaged as an assembly (DLL) that can be either placed in GAC or in ‘bin’ directory of a web application. It is also usable from client (including WinForms) applications.

RSS Toolkit works in Medium Trust (RssToolkit.dll Assembly either in GAC or in ‘bin’) with the following caveats:

  • If the ASP.NET application consumes RSS feeds, the trust level must be configured to allow outbound HTTP requests.
  • To take advantage of disk caching, there must be a directory (configurable via AppSettings["rssTempDir"]) where the trust level policy would allow write access. However, disk caching is optional.

In March, he updated it to release 1.0.0.1 adding a couple much-needed features:

  • Added MaxItems property to RssDataSource to limit the number of items returned.
  • Added automatic generation of <link> tags from RssHyperLink control, to light up the RSS toolbar icon in IE7. For more information please see http://blogs.msdn.com/rssteam/articles/PublishersGuide.aspx
  • Added protected Context property (of type HttpContext) to RssHttpHandlerBase class, to allow access to the HTTP request while generating a feed.
  • Added generation of LoadChannel(string url) method in RssCodeGenerator so that one strongly typed channel class can be used to consume different channels.
  • Fixed problem expanding app relative (~/…) links containing query string when generating RSS feeds.

Then in November, he decided that the project needed to get hosted somewhere and chose CodePlex and solicited someone to shepherd the project. I got the nod, and have re-released his 1.0.0.1 release and posted the source in the new CodePlex project ASP.NET RSS Toolkit. I want to make absolutely clear that up to and including this release, I haven't done anything to what Dmitry released. All the good work is his! In the upcoming couple of weeks, I'll be putting his tools to good use and will be growing the project, if you have any ideas what you want added, please use the project forums out on CodePlex. First on my list of things to support is full ETag and Last-Modified header support to enable bandwidth reduction. I also plan on adding event handlers to deal with redirects for feeds that move to a new home.

Thursday, December 28, 2006

The .Net Framework 2.0 language pack install fails on Vista

I've discovered that you can't install the .Net Framework Language Packs on Vista. They all claim to have already been installed, yet the localized resources directories are not present.

Looks like the language pack checks if the key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v2.0.50727\LCID exists (where the LCID is that of the pack). If the key exists, the pack declares that it has already been installed (it hasn't) and thus all the localized resources are not installed.

I have found an ugly workaround (note the values, delete the appropriate key, install the language pack, restore any missing values, set any that are now wrong). Reported, validated and workaround entered on Microsoft's Connect site, feel free to vote here.

Also, the Arabic language pack is built wrong. The internal INI file doesn't even provide a LCID in the key-to-check so it balks outright. I fixed that by extracting the pack manually, fixing the INI and following the above workaround. This issue is reported on Connect here.

The language packs I'm talking about install the localized resources for the various provided languages for all the user-display strings in the .Net Framework. This includes localized versions of exception messages, the messages shown by validators, ASP.Net wizards, etc. You install these to support end-users running in thier own language, especially for ASP.Net applications.

Let the whining commence

Don't get me started about how stupid it is that:

  1. Picking what language pack you want to download changes the language of the page itself.
  2. The file you download is served up as langpack.exe (not named uniquely by the locale).
  3. The language of the installer for the language pack doesn't respect the current language of the person running the installer, so you better watch carefully to figure out where the buttons move to when doing a RTL language or what the error messages that display mean if you aren't a native speaker.
  4. There isn't a single file that you can use to install all the language packs.

Can you imagine being a vertical-market developer that needs to show his customers how do this hackery?

Sunday, December 17, 2006

Extending the available range of DateTime in SQL Server 2005

Since I get so many hits for my various SQL DateTime articles, I thought I would point out an awesome contribution by Mladen Prajdic on building a UDT for SQL Server 2005 in CLR that gives you all that standard SQL Server DateTime behavior over the full range of the .Net DateTime class (01-01-0001 - 12-31-9999).

Nice work, Mladen!

Tuesday, December 12, 2006

COM and IDL and C++ and IFilter, oh my!

It's a blast from the past! On the Developmentor Advanced .Net mailing list recently, there was a bit of discussion about doing text extraction from Word documents. This is something I did in a past life and let me tell you it wasn't any fun.

First we tried using WinWord as an automation server. We would open the file, then save as text. This leaked resources horribly and was slow. It eventually would just lockup and the DCOM would go to sleep... not good for a product (at that time) based on a MTS/COM+ server running RDO components under IIS 5.

Next we tried about twenty or thirty "conversion" programs that were supposed to allow batch or individual conversion of file. This was even worse than using WinWord as the fidelity of the conversion was horrible at best.

Then we tried a couple of commercial libraries for data conversion. They were better at getting the text, but honestly less stable than WinWord.

Then it hit me, in the shower as all good ideas do, doesn't Microsoft's Index Server do a ripping good job sucking out the text of documents to index them? Obviously that can't be unstable or Index Server's crawller would die all the time. So I delved a bit into how Index Server works. Turns out that it loads a COM component and uses the IFilter interface to ask a pluggable filter to extract the text. All I had to do was find an example of an IFilter consumer. Trouble was, there weren't any... sure you could find a ton of IFilter providers to install and enable Index Server to crawl the files, but good luck finding an example of calling one.

Eventually I worked it all out and wrote a COM component of my own in C++ that loads the right IFilter driver for the file, then sucks all the text out. It was COM because the client program was a VB6 service.

This really rocked because all I had to do was find the right IFilter drivers for a document format and away it went... no configuration, just joy. And lest you think this is all old news and worthless, Microsoft's Desktop Search still uses IFilter magic. Google Desktop Search also can use IFilters with the appropriate plug-in. These days you can find plug-ins for everything from AutoCAD drawings to MP3 files (not to mention all the big-boys, like Word, WordPerfect, OpenOffice, PDF, etc).

So with the recent discussion on the ADVANCED-DOTNET list, I offered and was asked to dust off the component and publish it for your use. I had to do a tiny bit of cleanup and conversion to be usable with Visual Studio, but it's up for your pleasure. It can be used from .Net programs, of course, though I might eventually get around to rewriting the tiny amount of real code as a .Net library. I haven't looked into the quality of this, but there is something on CodeProject.

Download here.

Monday, December 11, 2006

What's wrong with this code...

private void EnsurePropsOwned()
{
    if (!this.propsOwned)
    {
        this.propsOwned = true;
        if (this.properties != null)
        {
            PropertyDescriptor[] descriptorArray1 = new PropertyDescriptor[this.Count];
            Array.Copy(this.properties, 0, descriptorArray1, 0, this.Count);
            this.properties = descriptorArray1;
        }
    }
   
    if (this.needSort)
    {
        this.needSort = false;
        this.InternalSort(this.namedSort);
    }
}
  1. If you're going to take ownership of something, wouldn't it be nice to grab a copy BEFORE you claim to own it?
  2. If you are going to sort something, don't you think you should actually sort it before clearing needs-sort flag?
The scary part is that this is code from System.ComponentModel.PropertyDescriptorCollection which is something deep inside the framework and completely impossible to debug for threading issues.

I would just report this as a bug to Microsoft, but since they have a history of just ignoring bugs (or worse yet, just closing them without reason), I'm not going to waste my time. Not that I could today anyway, since Microsoft Connect doesn't (irony!)

Friday, December 08, 2006

Today in blinding science...

Teen girls who are frequent scale-steppers often have more weight problems.

Really? Who paid for this? Can we get the money back?  They couldn't even get the order of the causality right, much less isolate it from other dependencies.  Sheesh.