# Averaging Numbers in an Array Based on References

• August 6th, 2010, 08:38 AM
aussiemcgr
Averaging Numbers in an Array Based on References
Ok, so this is sort of hard to explain, so stay with me. I have an array of Objects that contain a Name and a Number. The Names are dates (months and years). The Numbers need to be averaged, but based on conditions. I need to average two number together for each month (January through December). However, not all months for every year is there, not all months have data, the two most recent years must be averaged, and I cant average two months together that will give misleading averages (unless those are my only options).

My array objects will look something like this:
Apr-06 2,476
May-06 2,201
Jun-06 1,783
Jul-06 2,048
Aug-06 1,557
Sep-06 1,533
Oct-06 2,614
Nov-06 2,804
Dec-06 2,951
Jan-07 3,644
Feb-07 3,250
Mar-07 3,279
Apr-07 3,007
May-07 3,273
Jun-07 2,340
Jul-07 2,276
Aug-07 1,819
Sep-07 1,519
Oct-07 1,921
Nov-07 1,983
Dec-07 2,200
Jan-08 2,398
Feb-08 2,604
Mar-08 2,664
Apr-08 1,930
May-08 1,316
Jun-08 1,105
Jul-08 1,090
Aug-08 593
Sep-08
Oct-08
Nov-08
Dec-08
Jan-09
Feb-09
Mar-09
Apr-09
May-09
Jun-09
Jul-09
Aug-09
Sep-09
Oct-09
Nov-09
Dec-09 827
Jan-10 1,539
Feb-10 1,607
Mar-10 1,823

So for this array, I would want to do the following averages:
Jan: Jan-10 and Jan-08
Feb: Feb-10 and Feb-08
Mar: Mar-10 and Mar-08
Apr: Apr-08 and Apr-07
May: May-08 and May-07
Jun: Jun-08 and Jun-07
Jul: Jul-08 and Jul-07
Aug: Aug-07 and Aug-06
Sep: Sep-07 and Sep-06
Oct: Oct-07 and Oct-06
Nov: Nov-07 and Nov-06
Dec: Dec-07 and Dec-06

Notice how I dont want to average in Aug-08 and Dec-09 because they will make the averages misleading (as they are well below the trends of the rest of the data). Also, I dont want to average in Sep-08 through Nov-09 because those months dont have data for them.

Now, not factoring in the months that arent there and not factoring in the months without data are not too much of a hassle. But I cant think of a way to determine if the data is too small to include in the averaging or not. This is because each data series can have mediums of below 1000 or above 7000 or possibily higher or lower. Also, just because the data is below a certain point for the series, doesnt mean that it will make the data misleading. For instance, looking at the data, January's data is usually much higher than September's data. We dont want January's high data making it so September's data doesnt get counted because it gets considered too low.

Can anyone help me think of a process to determine if the data is inconsistent for the corresponding data for its month?
• August 6th, 2010, 08:54 AM
helloworld922
Re: Averaging Numbers in an Array Based on References
The easiest way to see if a number is "way off" is to simply discount the highest and lowest value for that month (assuming you've got at least 4-5 samples/month).

If you want to get fancier, you can compute the standard deviation for each month, and if it's not in some range (the smaller the range the more consistent your data is), remove the values that deviate by more than the standard deviation. Note that this method will likely require more data points than the first method in order to be more effective.

There are other methods you can use to determine if you data is good if you feel these two methods aren't good enough. Look inside of a statistics book (note that many of these methods will require significantly more data points per month compared to the two methods described above)
• August 6th, 2010, 09:26 AM
aussiemcgr
Re: Averaging Numbers in an Array Based on References
Well, there is the problem I'm facing. I'm usually not going to get 4-5 samples each time. For the program to run at all, I need at least 2, but that might be all I get (in which I would average the two regardless). The problem of finding out if a data point is out of the trend will occur when I get at least 3 data points (which is more than likely going to happen as long as the data is good). However, with like above, I only get 3 data points.

Quote:

The easiest way to see if a number is "way off" is to simply discount the highest and lowest value for that month (assuming you've got at least 4-5 samples/month).
The problem with this is the condition that if the most recent data is good, then use it in the average. The most recent data could be the highest or the lowest, but if it is not too low, then it will be used.

A problem will never occur when a point is too high to be included in the average. The only reason data may be too low is because either prior to that point or after that point, there was either a time of no data or an outside force that manipulated month's data. If it wasnt for the latter, finding that low point wouldnt be an issue, but my problem isnt that clean and simple.
• August 6th, 2010, 02:08 PM
helloworld922
Re: Averaging Numbers in an Array Based on References
Perhaps if you gave us the problem you are trying to solve on a high level?

In general these methods work well for a noisy data set, but there may be some special reason why they can't be adapted to fit your application, or there may be a much better way to determine what is bad data.
• August 6th, 2010, 03:53 PM
aussiemcgr
Re: Averaging Numbers in an Array Based on References
Quote:

Originally Posted by helloworld922
Perhaps if you gave us the problem you are trying to solve on a high level?

In general these methods work well for a noisy data set, but there may be some special reason why they can't be adapted to fit your application, or there may be a much better way to determine what is bad data.

What do you mean by "on a high level"?
• August 6th, 2010, 04:46 PM
helloworld922
Re: Averaging Numbers in an Array Based on References
What is the scope of the whole application? For example, does it count the number of people who visit a specific website each month and provide different tools for analyzing the data?
• August 6th, 2010, 06:39 PM
aussiemcgr
Re: Averaging Numbers in an Array Based on References
The numbers indicate the number of people Onboard an airplane per month for a specific market. Months or Years without data indicate periods of time where there were no flights for the current market. So the closest or several closest Months near these empty periods show the numbers when leaving the market or the numbers when reentering the market. In some instances, there will be months missing from the data set entirely (not just no data, but not in the array at all). The average for each month is needed to build a seasonal trend for demand.