user807566
user807566

Reputation: 2866

How to Quickly Remove Items From a List

I am looking for a way to quickly remove items from a C# List<T>. The documentation states that the List.Remove() and List.RemoveAt() operations are both O(n)

This is severely affecting my application.

I wrote a few different remove methods and tested them all on a List<String> with 500,000 items. The test cases are shown below...


Overview

I wrote a method that would generate a list of strings that simply contains string representations of each number ("1", "2", "3", ...). I then attempted to remove every 5th item in the list. Here is the method used to generate the list:

private List<String> GetList(int size)
{
    List<String> myList = new List<String>();
    for (int i = 0; i < size; i++)
        myList.Add(i.ToString());
    return myList;
}

Test 1: RemoveAt()

Here is the test I used to test the RemoveAt() method.

private void RemoveTest1(ref List<String> list)
{
     for (int i = 0; i < list.Count; i++)
         if (i % 5 == 0)
             list.RemoveAt(i);
}

Test 2: Remove()

Here is the test I used to test the Remove() method.

private void RemoveTest2(ref List<String> list)
{
     List<int> itemsToRemove = new List<int>();
     for (int i = 0; i < list.Count; i++)
        if (i % 5 == 0)
             list.Remove(list[i]);
}

Test 3: Set to null, sort, then RemoveRange

In this test, I looped through the list one time and set the to-be-removed items to null. Then, I sorted the list (so null would be at the top), and removed all the items at the top that were set to null. NOTE: This reordered my list, so I may have to go put it back in the correct order.

private void RemoveTest3(ref List<String> list)
{
    int numToRemove = 0;
    for (int i = 0; i < list.Count; i++)
    {
        if (i % 5 == 0)
        {
            list[i] = null;
            numToRemove++;
        }
    }
    list.Sort();
    list.RemoveRange(0, numToRemove);
    // Now they're out of order...
}

Test 4: Create a new list, and add all of the "good" values to the new list

In this test, I created a new list, and added all of my keep-items to the new list. Then, I put all of these items into the original list.

private void RemoveTest4(ref List<String> list)
{
   List<String> newList = new List<String>();
   for (int i = 0; i < list.Count; i++)
   {
      if (i % 5 == 0)
         continue;
      else
         newList.Add(list[i]);
   }

   list.RemoveRange(0, list.Count);
   list.AddRange(newList);
}

Test 5: Set to null and then FindAll()

In this test, I set all the to-be-deleted items to null, then used the FindAll() feature to find all the items that are not null

private void RemoveTest5(ref List<String> list)
{
    for (int i = 0; i < list.Count; i++)
       if (i % 5 == 0)
           list[i] = null;
    list = list.FindAll(x => x != null);
}

Test 6: Set to null and then RemoveAll()

In this test, I set all the to-be-deleted items to null, then used the RemoveAll() feature to remove all the items that are not null

private void RemoveTest6(ref List<String> list)
{
    for (int i = 0; i < list.Count; i++)
        if (i % 5 == 0)
            list[i] = null;
    list.RemoveAll(x => x == null);
}

Client Application and Outputs

int numItems = 500000;
Stopwatch watch = new Stopwatch();

// List 1...
watch.Start();
List<String> list1 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());

watch.Reset(); watch.Start();
RemoveTest1(ref list1);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();

// List 2...
watch.Start();
List<String> list2 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());

watch.Reset(); watch.Start();
RemoveTest2(ref list2);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();

// List 3...
watch.Reset(); watch.Start();
List<String> list3 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());

watch.Reset(); watch.Start();
RemoveTest3(ref list3);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();

// List 4...
watch.Reset(); watch.Start();
List<String> list4 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());

watch.Reset(); watch.Start();
RemoveTest4(ref list4);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();

// List 5...
watch.Reset(); watch.Start();
List<String> list5 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());

watch.Reset(); watch.Start();
RemoveTest5(ref list5);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();

// List 6...
watch.Reset(); watch.Start();
List<String> list6 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());

watch.Reset(); watch.Start();
RemoveTest6(ref list6);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();

Results

00:00:00.1433089   // Create list
00:00:32.8031420   // RemoveAt()

00:00:32.9612512   // Forgot to reset stopwatch :(
00:04:40.3633045   // Remove()

00:00:00.2405003   // Create list
00:00:01.1054731   // Null, Sort(), RemoveRange()

00:00:00.1796988   // Create list
00:00:00.0166984   // Add good values to new list

00:00:00.2115022   // Create list
00:00:00.0194616   // FindAll()

00:00:00.3064646   // Create list
00:00:00.0167236   // RemoveAll()

Notes And Comments

My question is: What is the best way to quickly remove many items from a List<T>? Most of the approaches I've tried look really ugly, and potentially dangerous, to me. Is a List the wrong data structure?

Right now, I am leaning towards creating a new list and adding the good items to the new list, but it seems like there should be a better way.

Upvotes: 82

Views: 104618

Answers (11)

Momchil Marinov
Momchil Marinov

Reputation: 99

If you still want to use a List as an underlying structure, you can use the following extension method, which does the heavy lifting for you.

using System.Collections.Generic;
using System.Linq;

namespace Library.Extensions
{
    public static class ListExtensions
    {
        public static IEnumerable<T> RemoveRange<T>(this List<T> list, IEnumerable<T> range)
        {
            var removed = list.Intersect(range).ToArray();
            if (!removed.Any())
            {
                return Enumerable.Empty<T>();
            }

            var remaining = list.Except(removed).ToArray();
            list.Clear();
            list.AddRange(remaining);

            return removed;
        }
    }
}

A simple stopwatch test gives results in about 200ms for removal. Keep in mind this is not a real benchmark usage.

public class Program
    {
        static void Main(string[] args)
        {
            var list = Enumerable
                .Range(0, 500_000)
                .Select(x => x.ToString())
                .ToList();

            var allFifthItems = list.Where((_, index) => index % 5 == 0).ToArray();

            var sw = Stopwatch.StartNew();
            list.RemoveRange(allFifthItems);
            sw.Stop();

            var message = $"{allFifthItems.Length} elements removed in {sw.Elapsed}";
            Console.WriteLine(message);
        }
    }

Output:

100000 elements removed in 00:00:00.2291337

Upvotes: -1

Patrik Kimmeswenger
Patrik Kimmeswenger

Reputation: 21

Lists are faster than LinkedLists until n gets realy big. The reason for this is because so called cache misses occur quite more frequently using LinkedLists than Lists. Memory look ups are quite expensive. As a list is implemented as an array the CPU can load a bunch of data at once because it knows the required data is stored next to each other. However a linked list does not give the CPU any hint which data is required next which forces the CPU to do quite more memory look ups. By the way. With term memory I mean RAM.

For further details take a look at: https://jackmott.github.io/programming/2016/08/20/when-bigo-foolsya.html

Upvotes: 2

Qwertie
Qwertie

Reputation: 17176

The other answers (and the question itself) offer various ways of dealing with this "slug" (slowness bug) using the built-in .NET Framework classes.

But if you're willing to switch to a third-party library, you can get better performance simply by changing the data structure, and leaving your code unchanged except for the list type.

The Loyc Core libraries include two types that work the same way as List<T> but can remove items faster:

  • DList<T> is a simple data structure that gives you a 2x speedup over List<T> when removing items from random locations
  • AList<T> is a sophisticated data structure that gives you a large speedup over List<T> when your lists are very long (but may be slower when the list is short).

Upvotes: 1

Yosef O
Yosef O

Reputation: 367

If the order does not matter then there is a simple O(1) List.Remove method.

public static class ListExt
{
    // O(1) 
    public static void RemoveBySwap<T>(this List<T> list, int index)
    {
        list[index] = list[list.Count - 1];
        list.RemoveAt(list.Count - 1);
    }

    // O(n)
    public static void RemoveBySwap<T>(this List<T> list, T item)
    {
        int index = list.IndexOf(item);
        RemoveBySwap(list, index);
    }

    // O(n)
    public static void RemoveBySwap<T>(this List<T> list, Predicate<T> predicate)
    {
        int index = list.FindIndex(predicate);
        RemoveBySwap(list, index);
    }
}

This solution is friendly for memory traversal, so even if you need to find the index first it will be very fast.

Notes:

  • Finding the index of an item must be O(n) since the list must be unsorted.
  • Linked lists are slow on traversal, especially for large collections with long life spans.

Upvotes: 24

NeilPearson
NeilPearson

Reputation: 126

Or you could do this:

List<int> listA;
List<int> listB;

...

List<int> resultingList = listA.Except(listB);

Upvotes: 3

NeilPearson
NeilPearson

Reputation: 126

I've found when dealing with large lists, this is often faster. The speed of the Remove and finding the right item in the dictionary to remove, more than makes up for creating the dictionary. A couple things though, the original list has to have unique values, and I don't think the order is guaranteed once you are done.

List<long> hundredThousandItemsInOrignalList;
List<long> fiftyThousandItemsToRemove;

// populate lists...

Dictionary<long, long> originalItems = hundredThousandItemsInOrignalList.ToDictionary(i => i);

foreach (long i in fiftyThousandItemsToRemove)
{
    originalItems.Remove(i);
}

List<long> newList = originalItems.Select(i => i.Key).ToList();

Upvotes: 3

arviman
arviman

Reputation: 5255

You could always remove the items from the end of the list. List removal is O(1) when performed on the last element since all it does is decrement count. There is no shifting of next elements involved. (which is the reason why list removal is O(n) generally)

for (int i = list.Count - 1; i >= 0; --i)
  list.RemoveAt(i);

Upvotes: 4

Alex L
Alex L

Reputation: 390

Ok try RemoveAll used like this

static void Main(string[] args)
{
    Stopwatch watch = new Stopwatch();
    watch.Start();
    List<Int32> test = GetList(500000);
    watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
    watch.Reset(); watch.Start();
    test.RemoveAll( t=> t % 5 == 0);
    List<String> test2 = test.ConvertAll(delegate(int i) { return i.ToString(); });
    watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());

    Console.WriteLine((500000 - test.Count).ToString());
    Console.ReadLine();

}

static private List<Int32> GetList(int size)
{
    List<Int32> test = new List<Int32>();
    for (int i = 0; i < 500000; i++)
        test.Add(i);
    return test;
}

this only loops twice and removes eactly 100,000 items

My output for this code:

00:00:00.0099495 
00:00:00.1945987 
1000000

Updated to try a HashSet

static void Main(string[] args)
    {
        Stopwatch watch = new Stopwatch();
        do
        {
            // Test with list
            watch.Reset(); watch.Start();
            List<Int32> test = GetList(500000);
            watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
            watch.Reset(); watch.Start();
            List<String> myList = RemoveTest(test);
            watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
            Console.WriteLine((500000 - test.Count).ToString());
            Console.WriteLine();

            // Test with HashSet
            watch.Reset(); watch.Start();
            HashSet<String> test2 = GetStringList(500000);
            watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
            watch.Reset(); watch.Start();
            HashSet<String> myList2 = RemoveTest(test2);
            watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
            Console.WriteLine((500000 - test.Count).ToString());
            Console.WriteLine();
        } while (Console.ReadKey().Key != ConsoleKey.Escape);

    }

    static private List<Int32> GetList(int size)
    {
        List<Int32> test = new List<Int32>();
        for (int i = 0; i < 500000; i++)
            test.Add(i);
        return test;
    }

    static private HashSet<String> GetStringList(int size)
    {
        HashSet<String> test = new HashSet<String>();
        for (int i = 0; i < 500000; i++)
            test.Add(i.ToString());
        return test;
    }

    static private List<String> RemoveTest(List<Int32> list)
    {
        list.RemoveAll(t => t % 5 == 0);
        return list.ConvertAll(delegate(int i) { return i.ToString(); });
    }

    static private HashSet<String> RemoveTest(HashSet<String> list)
    {
        list.RemoveWhere(t => Convert.ToInt32(t) % 5 == 0);
        return list;
    }

This gives me:

00:00:00.0131586
00:00:00.1454723
100000

00:00:00.3459420
00:00:00.2122574
100000

Upvotes: 3

Jon Skeet
Jon Skeet

Reputation: 1500055

If you're happy creating a new list, you don't have to go through setting items to null. For example:

// This overload of Where provides the index as well as the value. Unless
// you need the index, use the simpler overload which just provides the value.
List<string> newList = oldList.Where((value, index) => index % 5 != 0)
                              .ToList();

However, you might want to look at alternative data structures, such as LinkedList<T> or HashSet<T>. It really depends on what features you need from your data structure.

Upvotes: 20

Steve Morgan
Steve Morgan

Reputation: 13091

List isn't an efficient data structure when it comes to removal. You would do better to use a double linked list (LinkedList) as removal simply requires reference updates in the adjacent entries.

Upvotes: 38

Daniel A. White
Daniel A. White

Reputation: 190907

I feel a HashSet, LinkedList or Dictionary will do you much better.

Upvotes: 15

Related Questions