问题描述:

SHORT VERSION:

I have a List object that contains a number of duplicate values (double's) that exist in runs of duplicate values interspersed with runs of changing values. I want to reduce the space in memory that this List object takes up WITHOUT compromising the association between the indices and the values. I would also like to maintain as close to O(1) algorithmic look-up time as possible, using the index as the look-up. For example, if you have a list with elements {0, 0.1, 0.1, 0.1, 0.2}, then the new object/entity would always return 0.1 if given the index 1,2, or 3. I would expect I would need to create my own object (perhaps implementing IList), or use an existing object out there. I have an idea on how to implement this that would make the algorithm O(log(m)), where, m is the number of runs of identical values (in my example, there would only be 1 run). However, I would rather not roll my own if possible.

Does such an object exist for C#, or do I need to roll my own?

MOTIVATION/LONG VERSION:

I have a desktop application that is doing some heavy scientific calculations. The calculations generate a large amount of data, and that data is organized based on time. That is, for time 50, there is a value of variable x, y, and z. For time 51, there is another value of variable x, y, and z. I have a List that contains all the times that the calculation was run at. Each variable has a List, whose indexes are the same as those for the times List. That is, if you look at the index 234 of the time array, you might get the time 46 (seconds). The calculation of each variable at time 46 (seconds) would be then found at index 234 of the List for that variable.

There are around 100,000 such variables (and thus 100,000 Lists), but only one time List. I also expect to add many more variables. This is obviously a bit of a memory issue. (at least around 200 MB of raw space at present time :-) ). This should also explain why I want to use the index as the method of finding the value of a certain variable at a certain time.

It is fairly typical for a variable to have only 0's in the first x number of slots. Or after index y, the variable holds constant until the end. I would say that a worst case for the number of periods where the values are constant might be around 30 in a single list, but more typically between 2 and 5. The number of total values in each array might be typically around 250.

EDIT:

Note that I am expecting to add even more variables than the 100,000, so this is bigger issue than just 200 MB. To explain more of the motivation for this, my app was running at around 1+ GB at present, and I saw the 200 MB as the low-hanging fruit for reduction of memory usage.

EDIT2:

I realized a very important edit to my explanation- I have editted it above and also explain it here. The Lists may have runs in them, but they also have sections where the values are changing from index to index. So a better example of a list I might have is the following:

0 0 0 0 0 0 ....(50 duplicate 0's )...0.1 0.2 0.4 0.5 0.6 ... (50 more changing values) ... 200.45 200.45 200.45 200.55 ... (50 more duplicate values) .... etc.

网友答案:

I'm assuming that your O(log(m)) idea is to basically create a binary search tree, using the index range to order the results.

I would absolutely go with that solution. If you only have up to about 30 runs per list, you really don't need to worry about the way it scales with m, as m is never particularly large... you may well find that any constant-time solution is actually worse in any real-world case than your search tree approach.

In fact, I'd probably initially go for a simple list of runs (where each run is an index range and a value) and an O(m) lookup... if your typical size is 2-5, then it won't be particularly bad, and it will be simpler to implement. Once you've got a simple approach working, then you can optimize.

In fact, I'd start off without even doing this "run" version at all to start with. Unless you need to run this on particularly limited mobile phones, 200MB or so is really not too large a data set. What machines will the application actually be running on? Do you have reason to believe they can't afford, say, half a gigabyte for your application?

It's also worth bearing in mind that the overhead of the binary search tree or the list of runs may well mean you don't save as much as you'd expect to anyway.

Basically, I'd implement in this order:

  • Arrays
  • List of runs
  • Binary search tree

Benchmark the performance (time and space) at each step, and make sure you have concrete goals about what's good enough.

EDIT: With the edited version, you might want to have some sort of interface IPortion with:

int MinIndexInclusive { get; }
int MaxIndexExclusive { get; }
double FindValue(int index);

with two implementations: ArrayPortion and TreePortion. Each node of the TreePortion would have a left side and a right side, each of which was another IPortion - that could let you have an ArrayPortion embedded within a TreePortion, for example.

Or somewhat simpler, you could just keep it flat, and have a List<IPortion> where each IPortion was either an ArrayPortion or a RunPortion where the RunPortion only knew about a single value and its index bounds. You could then do a binary search on the list to find the right portion, and ask it for the value at the index.

网友答案:

Seems to me that you could do this with a List<T> and a binary search. You don't need to store a list of runs. All you really need to store is the index and value when the time changes.

So, have a simple struct:

struct ValueChange
{
    public int TimeIndex;  // or whatever type you use for the index
    public double Value;
    // Add constructor here
}

(Yes, I know that mutable values in structs are bad. I coded it this way for brevity. In real code, those would be readonly properties with private backing fields.)

Then you have a List<ValueChange>. Whenever the value changes, you append one of those to the list. You can tell if the value changed easily enough:

if (currentValue != theList[theList.Count-1].Value)
{
    theList.Add(new ValueChange(timeIndex, currentValue));
}

And when you want to find what the value was at a particular time index, you do a binary search for the time index. If the index you're looking for isn't there, the return value for List.BinarySearch will tell you the index of the item that contains the value you're looking for.

The drawback to any kind of run-length compression, of course, is that short runs turn it into a data expander rather than a compressor. In this particular case, you need an overall run-length average of 2 in order to break even. That is, if you want to represent values for N time periods, you can't have more than N/2 value changes, because the ValueChange structure is twice the size of your double.

相关阅读:
Top