Finding Duplicate Items and The Duplicates Keyword

I have had a few questions around de-duping files within a SharePoint environment recently so I set off to do some research to identify a good solution.  Based on past experiences I knew that SharePoint identifies duplicates while performing an index of the content so I expected this would be part of the solution.

Upon starting my journey, I found a couple of threads on various forums where the question has been asked in the past.  The first one was “Good De-Dup tools for SharePoint” which had a link to a blog post by Gary Lapointe that offered a PowerShell script that can list every library item in a farm.  At first glance this seemed to be neat, but not helpful here.

Next I found a blog post with another handy PowerShell script.  This blog post was title Finding Duplicate Documents in SharePoint using PowerShell.  I found this script interesting, albeit dangerous.  It will iterate through all of your site collections, sites, and libraries, hash each document and compare for duplicates.  It however only identifies duplicate documents within the same location.  The overhead of running this script is going to be pretty high, and it gets a little risky when you have larger content stores.  I would be worried about running this against an environment that has 100s of sites, or large numbers of documents.

Next I found an old MSDN thread named Find duplicate files which had two interesting answers.  The first was to query the database (very bad idea) and the second was a response by Paul Galvin that pointed to the duplicates keyword property, and a suggestion to execute a series of alpha wildcard searches with the duplicates keyword.  While I have used the duplicates keyword before I had never thought to use it in this context so I set out to give it a try.

As I mentioned at the beginning SharePoint Search does identify duplicates documents.   It does this by generating a hash of the document.  Unlike the option above where the PowerShell generates a hash, the search hash seems to separate out the meta-data so even items with unique locations, meta-data, and document names can still be identified as identical documents. 

When doing some tests though I quickly discovered that the duplicates property requires the full document URL.  This means that you would have to execute a recursive search.  First you would have to get a list of items to work with, and then you would then need to iterate through each of those items and execute the duplicates search with a query such as duplicates:”[full document url]”.

Conceptually there are two paths forward at this point.  The first is to try and obtain a list of all items from SharePoint Search.  Unfortunately you cannot get a full list of everything.  The best you can do is the lose title search that Paul had suggested.  Something like title:”a*” which would return all items with an a in the title.  You would then have to go through and do that for all letters and numbers.  One extra challenge is that you will be repeatedly processing the same items unless you are using FAST Query Language and have access to the starts-with operator and can do something like title:starts-with(“a”).  In addition, since we are only looking for documents, its an extremely good idea to also add in the isdocument:true to your query to ensure that only documents are returned.  Overall this is a very inefficient process.

An alternative would be to revisit and extend Gary’s original script to execute the duplicates search for each item.  The advantage here is that you would guarantee that you are only executing the duplicates search once for each item which would reduce the total processing and extra output information to be parsed.  The other change to Gary’s script would be to change what is written out to the log file since you would only write out the information for items that are identified as duplicates. 

Access to Content Anywhere with SkyDrive

Like many people, my overall tech habits have changed quite a bit over the past few years.  Where I used to work primarily off of only one or two computers and I had good separation between work and personal stuff the lines have gotten a bit blurry.  Microsoft has quietly been really amping up the SkyDrive offering and has built it into a really powerful tool  SkyDrive now gives me the flexibility to easily make my content accessible no matter where I am or what device I am using.

At this point I stay pretty busy between my consulting work, SharePoint community involvement, and attending various tech events.  I find myself all over the place geographically, but also using a slew of devices.  At this point I have a work laptop, a computer for home, a surface tablet, and my phone.  Using SkyDrive I have easy access to that content on each of those devices.  There is now great support for integrating SkyDrive into both the windows (desktop, phone) and non-windows experiences (iOS, Android). 

When I find myself at an event I tend to rely on a tablet device for taking notes with OneNote (the best MS Office tool ever!), and storing those notebooks in the SkyDrive makes the content accessible, while also making sure it is properly backed up. 

I also find that it is helpful for my writing tasks whether it is this blog, or even when I was writing the book for Packt previously.  I had easy access to my notes, from both the regular devices, but also from within my VMs used to support the material.  I’ve found the whole sync process pretty rock solid, and I’m now to the point that I really never use the My Documents folder on any of my windows PCs since all of the content is either stored in either SkyDrive or SharePoint (work and project related content).

Anyone else leveraging the tool?  Has it had a positive impact on your work?

SQL Saturday Charlotte

SQL Saturday #174 in Charlotte has been announced for Saturday October 27th and is shaping up to be a great event.  The overall focus will be on Microsoft BI and with the strong BI focus there will also be a number of SharePoint related sessions available.  I will be presenting a session on SharePoint’s BCS in order to demonstrate additional ways to connect your SharePoint environment to your line of business (LOB) systems.

Event and registration details can be found on the site:  http://www.sqlsaturday.com/174/eventhome.aspx

Hope to see many of you there!

SharePoint Saturday New York–Wrap-up

The SharePoint Saturday NY event was a huge success.  Great to see so many people show up and engaged in discussions.  The organizers and volunteers did a great job this year to make sure the event went off without a hitch.  Very well orchestrated!

Thank you to everyone that attended my session.  I hope you enjoyed it and found it useful in building your development skills around search and FAST Query Language. 

 

The source code is available here.

Congratulations to the winners of the give away for copies of my book SharePoint 2010 Business Application Blueprints

Upcoming Speaking Events

I’m happy to announce that I’ll be presenting at a couple of upcoming events.

I will be presenting my session Developing FAST Queries at SharePoint Saturday NY on Saturday July 28th at the Microsoft Offices in Manhattan.

I will also be presenting a pair of new sessions at SharePoint Saturday The Conference August 22nd-24th at the Fairview Marriott in Falls Church, VA.  My first session will be Enhanced Site Findability which covers techniques for making sites easier to find using the search features, and the second session will be Enhanced Content Personalization and Targeting which covers techniques for mastering the highly personalized and targeted content pages that perform well for all users.   This marks the third year in a row I have been selected to speak at this epic event.

I am definitely looking forward to both events and they are sure to be highlights of the summer conference events.

SharePoint 2010 Business Application Blueprints Now Available

My book, SharePoint 2010 Business Application Blueprints has been officially released and is now available both in print and as an electronic download.

 

What you will learn from this book

You will see how to build the following SharePoint projects:

  • An Effective Intranet Site for your organization that maximizes the site’s ability to aggregate content and is highly effective at communicating important messages
  • A Workflow Out of Office Solution that allows users to manage their out of office dates and automate task assignments to a delegated resource
  • A Company Forms Site with the definition of form content types and organizing the forms into a usable interface 
  • An Engaging Community Site including custom features that can be used to enhance collaboration and provide an information sharing system
  • A Site Request and Provisioning System to help governance and compliance activities
  • A Project Site Template to support project initiatives and track Issues, Tasks, and Contacts
  • A Project Management Main Site that can aggregate the key metrics and status information from the project management sites previously created
  • A Task Rollup solution that can aggregate tasks from the specified sites
  • A dynamic Site Directory solution that leverages Search

 

Approach

The hands-on example solutions in this book are based on fictitious business development briefs, and they illustrate practical ways of using SharePoint in various business scenarios.

A chapter is dedicated to each example SharePoint solution covering step-by-step instructions for building the SharePoint solutions, aided by the extensive use of screenshots.

 

While it is late to the party for this product cycle, I think there are some great examples that will serve developers for years to come. 

%d bloggers like this: