Categories
duplicates

C#: IEnumerable get Duplicates (Select Non-Distinct with LINQ)

Saw this page: Raucous Random Ramblings: Select Non-Distinct with Linq!

[Updated 2012-06-12]

Not too shabby.. FWIW: the GetDuplicates() from the raucous random ramblings post returns all items it considers are duplicates.

This is different than my version, which only returns items after they are found in the list. I consider my version better because: if I existed and someone made a clone of me (i.e. ‘duplicated’), then I would not consider myself contained in the set of the duplicates. 🙂 But I can understand someone using both functions for different purposes.)

Here is my version for getting duplicates from an enumerable list.
Better? You try both and then let me know what your results are..

.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: Consolas, “Courier New”, Courier, Monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }

/// 
///   Returns duplicate items found in the .
/// 
public static IEnumerable Duplicates( this IEnumerable sequence ) {
if ( null == sequence ) {
throw new ArgumentNullException( "sequence" );
}
var set = new HashSet();
return sequence.Where( item => !set.Add( item: item ) );
}

In the testing I slimmmed the two foreach down to “return results.SelectMany( @group => @group )” per Resharper’s suggestion.

Examples for the set of

    {1,2,3,4,5,6,7,8,9,6,4,2,2,4,6}
  • Duplicates() returns {6,4,2,4,4,6}
  • GetDuplicates() returns {2,2,2,4,4,4,6,6,6}

Speed tests!
I threw together a function timer with a massive UInt64[1048576] to check for duplicates.
Letting the tests run for a few minutes (over the same set of data)…

  • Duplicates() processes at 3338 units every millisecond.
  • GetDuplicates() processes at 1060 units every millisecond.

Remember: the test results are going to vary under different conditions, of course.
Ignoring the duplicate duplicates that both functions return with a .Distinct() did not make a noticeable timing impact.
Increasing the count of duplicates in the test data increased the speed of both function.
I do like the result set my function returns, and it does perform 3X faster than the raucous post..

One reply on “C#: IEnumerable get Duplicates (Select Non-Distinct with LINQ)”

I believe this results in a worst case of O(n^2) and a best case of O(nlog(n)) which is worse than the previous post which was O(2n) (or O(n) as constants are irrelevant) which is much faster

Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s