Tag: PowerShell

Finding Duplicate Items and The Duplicates Keyword

I have had a few questions around de-duping files within a SharePoint environment recently so I set off to do some research to identify a good solution.  Based on past experiences I knew that SharePoint identifies duplicates while performing an index of the content so I expected this would be part of the solution.

Upon starting my journey, I found a couple of threads on various forums where the question has been asked in the past.  The first one was “Good De-Dup tools for SharePoint” which had a link to a blog post by Gary Lapointe that offered a PowerShell script that can list every library item in a farm.  At first glance this seemed to be neat, but not helpful here.

Next I found a blog post with another handy PowerShell script.  This blog post was title Finding Duplicate Documents in SharePoint using PowerShell.  I found this script interesting, albeit dangerous.  It will iterate through all of your site collections, sites, and libraries, hash each document and compare for duplicates.  It however only identifies duplicate documents within the same location.  The overhead of running this script is going to be pretty high, and it gets a little risky when you have larger content stores.  I would be worried about running this against an environment that has 100s of sites, or large numbers of documents.

Next I found an old MSDN thread named Find duplicate files which had two interesting answers.  The first was to query the database (very bad idea) and the second was a response by Paul Galvin that pointed to the duplicates keyword property, and a suggestion to execute a series of alpha wildcard searches with the duplicates keyword.  While I have used the duplicates keyword before I had never thought to use it in this context so I set out to give it a try.

As I mentioned at the beginning SharePoint Search does identify duplicates documents.   It does this by generating a hash of the document.  Unlike the option above where the PowerShell generates a hash, the search hash seems to separate out the meta-data so even items with unique locations, meta-data, and document names can still be identified as identical documents. 

When doing some tests though I quickly discovered that the duplicates property requires the full document URL.  This means that you would have to execute a recursive search.  First you would have to get a list of items to work with, and then you would then need to iterate through each of those items and execute the duplicates search with a query such as duplicates:”[full document url]”.

Conceptually there are two paths forward at this point.  The first is to try and obtain a list of all items from SharePoint Search.  Unfortunately you cannot get a full list of everything.  The best you can do is the lose title search that Paul had suggested.  Something like title:”a*” which would return all items with an a in the title.  You would then have to go through and do that for all letters and numbers.  One extra challenge is that you will be repeatedly processing the same items unless you are using FAST Query Language and have access to the starts-with operator and can do something like title:starts-with(“a”).  In addition, since we are only looking for documents, its an extremely good idea to also add in the isdocument:true to your query to ensure that only documents are returned.  Overall this is a very inefficient process.

An alternative would be to revisit and extend Gary’s original script to execute the duplicates search for each item.  The advantage here is that you would guarantee that you are only executing the duplicates search once for each item which would reduce the total processing and extra output information to be parsed.  The other change to Gary’s script would be to change what is written out to the log file since you would only write out the information for items that are identified as duplicates. 

Bulk Updates of User Profile Properties

This past week fellow SharePoint MVP Yaroslav Pentsarskyy posted an excellent PowerShell script for doing bulk updates on the UserProfile properties via PowerShell.  The Bulk Update SharePoint 2010 User Profile Properties is a great script that makes it extremely easy to populate any new fields that are not set to synchronize. 

My team has been doing a lot of client work promoting the user of User Profiles for use within customizations or to drive business processes.  For a quick overview check out my blog post Permanent Link to User Profiles – Driving Business Process or sit in on my Developing Reusable Workflow Features presentation at SharePoint Saturday NY on July 30th or SharePoint Saturday The Conference 2011 August 11-13th.

This also demonstrates another great example of the value that PowerShell can bring to Building and Maintaining a high functioning SharePoint environment.

THE SharePoint PowerShell Module (SPoshMod)

The first release of THE SharePoint PowerShell Module (SPoshMod) is now available on CodePlex.

As I mentioned in a previous post, PowerShell is a great tool for SP Admins and Developers to learn. It can really make administration and content deployment tasks easier and repeatable.

Take a look and provide feedback to the team!

PowerShell and SharePoint

I come from a development background but the past few years I have been at lot more time on system administration with applications like SharePoint. While there are some good command line tools, I am not a great command line scripter. Getting proper code flow when writing batch files leaves a lot to desired when you have the developer’s mindset. Until recently the only alternative was to write .net code which seems like overkill for most operations. I think PowerShell really fits this gap well, providing access to managed code APIs and assemblies while also offering the nimbleness of scripting and interactive sessions.

The past month or two I have been working on scripts that can do everything from creating a new web application to deploying solutions and content types. These scripts can make deployment a whole lot easier and more effective.

After I get a chance to fully test my scripts I’ll be sure to post them here or on CodePlex. If you have not yet dived in, now is a good time.

Resources:
CodePlex PowerShell Resources
The PowerShell Guy
Windows PowerShell Blog
SharePoine Dev Wiki – Scripting

%d bloggers like this: