Extending FAST Search Processing Pipeline

One of the major benefits of using FAST Search for SharePoint Server 2010 (FS4SP) is the ability to extend the item processing pipeline and modify existing or populate new crawled properties of each document programmatically. This concept may sound complicated at first but in reality it’s not that hard at all. In this blog post I’m going to show how to integrate a C# console application into the processing pipeline and use custom logic to populate an additional crawled property for each item in the search index.

Let’s say we have a number of SharePoint project sites where each site contains information about a different digital camera model and we’d like to tag each document located within any of the project sites with the project name (camera model) in the search index without adding any extra metadata to SharePoint items.

To accomplish that we are going to populate a custom crawled property called Project by extracting the project name from site urls that match a specific pattern:

  • http://intranet/sites/sp2010pillars/Projects/M300/
  • http://intranet/sites/sp2010pillars/Projects/M400/
  • http://intranet/sites/sp2010pillars/Projects/M500/
  • http://intranet/sites/sp2010pillars/Projects/X200/
  • http://intranet/sites/sp2010pillars/Projects/X250/

First of all we need to create a new crawled property to be populated. It is a good practice to create a new crawled property category so that the custom crawled properties don’t get mixed up with SharePoint or any other properties in the search index schema. Since crawled property categories are uniquely identified with a GUID, we need to generate a new GUID. One option is to use Visual Studio 2010 for that – Tools -> Create GUID:

Next we’ll use PowerShell to create the new category called Custom and add the new Project crawled property to it. In the next blog post I’m planning to show how to add a new refiner to the FAST Search Center based on the values we populate the Project crawled property with so let’s go ahead and create and map it to a new managed property.

Add-PSSnapin Microsoft.FASTSearch.Powershell -ErrorAction SilentlyContinue

$guid = "{21FDF551-3231-49C3-A04C-A258052C4B68}"
New-FASTSearchMetadataCategory -Name Custom -Propset $guid

$crawledproperty = New-FASTSearchMetadataCrawledProperty -Name Project -Propset $guid -Varianttype 31
$managedproperty = New-FASTSearchMetadataManagedProperty -Name Project -type 1 -description "Project name extracted from the SharePoint site url"

Set-FASTSearchMetadataManagedProperty -ManagedProperty $managedproperty -Refinement 1
New-FASTSearchMetadataCrawledPropertyMapping -ManagedProperty $managedproperty -CrawledProperty $crawledproperty

Now we are ready to create the console application that contains our custom logic.

The following code is going to be used to read the url input crawled property value, check if it matches our project site url pattern and extract the project name from the url if it’s a match.

using System;
using System.Linq;
using System.Xml.Linq;
using System.Text.RegularExpressions;

namespace Contoso.ProjectNameExtractor
{
    class Program
    {
        // special property set GUID that contains the url crawled property
        public static readonly Guid PROPERTYSET_SPECIAL = new Guid("11280615-f653-448f-8ed8-2915008789f2");

        // Custom crawled property category GUID that contains the Region crawled property
        public static readonly Guid PROPERTYSET_CUSTOM = new Guid("21FDF551-3231-49C3-A04C-A258052C4B68");

        // crawled property name to be populated
        public const string PROPERTYNAME_REGION = "Project";

        static void Main(string[] args)
        {
            XDocument inputDoc = XDocument.Load(args[0]);

            // retrieve the url input property value
            string url = (from cp in inputDoc.Descendants("CrawledProperty")
                          where new Guid(cp.Attribute("propertySet").Value).Equals(PROPERTYSET_SPECIAL) &&
                          cp.Attribute("propertyName").Value == "url" &&
                          cp.Attribute("varType").Value == "31"
                          select cp.Value).First();

            XElement outputElement = new XElement("Document");

            // project site url regex
            Match urlMatch = Regex.Match(url, "(?<=http://intranet.contoso.com/sites/sp2010pillars/Projects/).*?[^/]+", RegexOptions.IgnoreCase);
            if (urlMatch.Success)
            {
                // populate the custom Region crawled property
                outputElement.Add(
                    new XElement("CrawledProperty",
                        new XAttribute("propertySet", PROPERTYSET_CUSTOM),
                        new XAttribute("propertyName", PROPERTYNAME_REGION),
                        new XAttribute("varType", 31),
                        urlMatch.Value)
                        );
            }

            outputElement.Save(args[1]);
        }
    }
}

At this point we are ready to deploy the application to the FAST Search servers. In order to do that we need to copy the executable to each FAST server running document processors and modify the pipelineextensibility.xml file located in the FASTSearch\etc folder on each of those servers. Keep in mind that the pipelineextensibility.xml file can get overwritten if you install a FAST Search Server 2010 for SharePoint update or service pack. Below is the file content assuming that the executable is located in the FASTSearch\bin folder:

<PipelineExtensibility>
	<Run command="Contoso.ProjectNameExtractor.exe %(input)s %(output)s">
		<Input>
			<CrawledProperty propertySet="11280615-f653-448f-8ed8-2915008789f2" varType="31" propertyName="url"/>
		</Input>
		<Output>
			<CrawledProperty propertySet="21FDF551-3231-49C3-A04C-A258052C4B68" varType="31" propertyName="Project"/>
		</Output>
	</Run>
</PipelineExtensibility>

Once all of the above is in place, simply execute psctrl reset command in Microsoft FAST Search Server 2010 for SharePoint shell and submit a full crawl for the SharePoint content source. When the full crawl is complete let’s run a search query for “digital camera” and take a look at the Project property value in the results:

As you can see, the managed property is populated with the expected values. In the next post I’ll show how to use this new property as a custom refiner in the FAST Search Center.

References:

  1. Integrating an External Item Processing Component
  2. CrawledProperty Element [Pipeline Extensibility Configuration Schema]