Reputation: 7356
I have an application in which I am analyzing a system where there are a large number of interactions. And I need to make certain choices based on the frequency of the occurrences of unique items in the system. For example, if you had this list of letters:
A, B, F, G, A, T, S, B, S, B, S, Q, Z, B, Q, S
Here is a list showing how often each letter occurs (occurrences):
A - 2
B - 4
F - 1
G - 1
Q - 2
T - 1
S - 4
Z - 1
So the frequency of the occurrences are as such (occurrence occurrences):
4 - 2
2 - 2
1 - 4
The above is a tiny example, but I've attached an image which is a simple line graph of a larger system
In this graph the numbers along the bottom aren't really important. They are just marking the number of unique frequencies. And the Y-axis marks the value of that frequency.
What I'm looking for is a mathematical/programmatic way to find the point where that line begins to break upwards. My searches haven't yielded what I'm looking for as I'm not really sure what the proper terminology is, or the name of the concept.
Right now, we have to manually choose that point based on a human looking at the numbers and saying "here". But I want to, at the very least, already have a "recommended" value chosen, and at the most, be able to remove the human component completely.
For clarification, my current algorithm is producing a list of number pairs occurrence to occurrence frequency. My use of the word "frequency" in no way relates to electromagnetic signals, but rather to how often an occurrence occurs. But I thought that saying "occurrence occurrences" would be more confusing!
In this system, the general trend is that a few entities will show up in a large number of interactions, more entities will show up in a medium number of interactions, but the greatest number of entities will show up in just a few, or even no, interactions. It would be tough to imagine a scenario where it was different than that... worst case would probably be a plateau. But there could definitely be a dip after a jump at any point from the beginning to the end. The illustration above just doesn't show that. We cannot assume that there will be a point where it will begin to rise with no drops afterwards.
Here is my data. (The simple graph above was produced with the Occurrence Frequency column data only):
This list, as you can see, is sorted in descending order on the occurrence column. This is from a small system with 904 unique entities. Those entities have 38 unique occurrence rates. If you started at the top of this list, you could say:
"2 entities occur 309 times"
"1 entity occurs 130 times"
etc.
Ultimately what I'm trying to determine is the importance of an entity based on how often it occurs in the system. I need to be able to flag certain items as "important", but all items can't be important. And the method/algorithm I'm looking for would help to identify at what point in that list do I stop considering items important.
If you look at the list, you can see where the lower occurrences start becoming more frequent. I don't think that I can sort on the right column because the left column is really the key data. Greater occurrences = more importance.
But I still need to figure out how to determine that.
Upvotes: 2
Views: 284
Reputation: 11251
Based on your statement:
Ultimately what I'm trying to determine is the importance of an entity based on how often it occurs in the system. I need to be able to flag certain items as "important", but all items can't be important. And the method/algorithm I'm looking for would help to identify at what point in that list do I stop considering items important.
I would examine the problem in terms of probability and statistics instead of a function graph. Using your sample data, the probability of a certain letter x occurring is simply the number of x's in the data divided by the total count of letters.
Some simple possibilities to try out:
You could use much more advanced probability theory, but one of these might be good enough.
Upvotes: 0
Reputation: 35600
Is there any reason the larger example isn't sorted? If you sort it by increasing Y values, then you can take the slope of each consecutive pair, and call the breakpoint where the slope changes significantly.
You can tweak the rules for "changes significantly" to meet your exact needs. It might be as simple as "the slope that increase most compared to the previous", or "the first slope that varies more than X% from the running average slope". Or maybe the largest rss of the differences between the slope at the test point and the one before and the one after.
After the edit, I think it may be as simple as taking a percentage. Multiply each X and Y, and take the sum over all entries. That's the total number of events observed. Now start from the bottom if your table, and start subtracting each row's product from the total until you get to less than X% of the original total. What you are left with is the "significant" events that contributed most to the total.
I have a feeling this is a common problem in statistics, but I don't have enough background to say what the proper terminology is, although standard deviations come to mind.
Upvotes: 2
Reputation: 43494
All you need to do is to refine this "find the point where that line begins to break upwards". Besed on what you say I can assume and take as a precondition that the line always brakes upwards and since that point it will never go down (not even a single step). This means in your example it'll return 33, not 32.
It also assumes you'll have at least 2 values... if you have one there is nothing to compare it to, right? :)
So, the algorithm to solve this would be something like:
repeat
$previousYValue = get the highest Y value
$previousXValue = get the X value corresponding to $previousYValue
$currentXValue = $previousXValue - 1
$currentYValue = get the Y value for $currentXValue
until ($currentYValue > $previousYValue)
print "The line breaks upwards at point: $previousXValue with value $previuosYValue"
Hope that helps
Upvotes: 1