Thursday, September 26, 2013

SharePoint 2010 Enterprise Search File Types Inclusion - not Exclusion

I just had a client which was looking to move their internet-facing search infrastructure from the expensive Google Search Appliance to SharePoint 2010 Enterprise Search. After creating/configuring the content sources, I launched a full crawl. Once the crawl completed, I noticed that my index contained ~66% of the content the GSA indexed.

As I dug into the GSA configuration, I came across a frustrating scenario. GSA has a list of file extensions to EXCLUDE from crawls...which was a very short list (jpg's, gif's, mov, avi, mp3, etc)....my problem was that SharePoint 2010 Enterprise Search's file type list is an Inclusion list, as opposed to an Exclusion list. The pre-populated list contains office and other common document extensions (~20) OOTB. This list helps with indexing collaboration documents and content without indexing lots of files which collaboration users would never require.

While that's a time saver for those implementing standard Enterprise Search in an intranet scenario, it is not a very good model for indexing external/public-facing content, (internet site, line of business applications/databases) for internet users, as many of these articles/content are spread across a wide range of file types.

For those of you out there who are screaming "my 2010 SharePoint Farm contains an exclusion list for crawled file types"... I bet you are running FAST Search as opposed to Enterprise Search... as FAST (F4SP) for SharePoint does indeed have the exclusion list and not the inclusion list.

After a bit of digging, I was able to find a post which detailed a solution for replacing Enterprise Search's Inclusion list, with an Exclusion list...(flipping the scenario upside-down and providing the same configuration as GSA).

Thanks to Allen Wang's and Venkat's posts: SharePoint 2010 Search File Type Include or Exclude
Thanks to Venkat's post for the PowerShell Solution: SharePoint 2010 Enterprise search to maintain Exclusion List for Crawled file Types Instead of Inclusion List

<Excerpted from above blog>

To flip the current Search Service Application to Maintain Exclusion File Types list instead of Inclusions list: (Run the below command in SharePoint PowerShell Console:)

  • Find your Search Admin Application's Application Class ID:
$sa = Get-SPServiceApplication | where { $_.ApplicationClassId -eq “52547a3d-66ed-468e-b00a-8c4a3ec7d404″ }


  • Set the Search Service Application to maintain Excluded File Types: (Run the below command in SharePoint PowerShell Console:)

$sa.SetIsExtensionIncludeList($sa.GetVersion(),0);

  • Stop and Start Search: (Run the below command in SharePoint PowerShell Console:)

net stop OSearch14
net start OSearch14

  • Remove the existing File Types: (Run the below command in SharePoint PowerShell Console:)
*Replace the “SSA” with the name of your Search Service Application*

$ssa = Get-SPEnterpriseSearchServiceApplication -Identity “SSA
$content = New-Object Microsoft.Office.Server.Search.Administration.Content($ssa)
$extList = $content.ExtensionList
$list = New-Object System.Collections.ArrayList
foreach ($ext in $extList)
{
$list.Add($ext);
}
for ($i = 0; $i -lt $list.Count; $i++)
{
$ext = $list[$i]
$ext.FileExtension
$ext.Delete()
}

  • Run a full crawl on content source and you should now see all the pages are being crawled except the file types in exclusion list

No comments:

Post a Comment