最后更新于2017年11月6日星期一21:16:16 GMT

在这里 Logentries we are constantly adding to the options for analysing log generated data. 查询语言 “LEQL” has a number of statistical functions 和 a recent addition has been the new 标准偏差 计算.

LEQL查询示例
where(image=debian) groupby(location) calculate(st和arddeviation:usage)
An interesting point that relates to this new function is that at the heart of its implementation is the 计算 of “方差” of the data using a specific type of algorithm – namely an “在线” algorithm. Once the “方差” is found then the 标准偏差 is the square root of this “方差” value. 挑战在于找到一系列数据点的“方差”, 在事先不知道会有多少数据点的地方.

在线算法

An “在线” algorithm is one that is designed to h和le processing input data that arrives in a sequence, 而且不是一个完整的集合. 以日志事件为例, 这种类型的数据一个事件接一个事件地依次到达. An “在线” algorithm is designed to process each new piece of data or log event as it arrives to produce a final result. Also note that such algorithms are not designed with any assumptions about future data that may arrive, 比如有多少事件,什么时候发生. 这些都是算法设计者必须考虑的未知因素.

的 opposite would be an ‘offline’ algorithm where the complete data set of interest is provided to the algorithm at the same time. 的refore the ‘offline’ algorithm will start out 和 finish with a known fixed number of data points.

试图建立任何一种类型的性能, it is easier for ‘off-line’ algorithms as “for each sequence of requests, such an algorithm selects that sequence of actions which minimise the cost”. 的 difficulty with ‘on-line’ algorithms is that ‘whatever actions an on-line algorithm takes in response to an initial sequence of requests, there will be a sequence of further requests that makes the algorithm look foolish”. 关于这方面的进一步研究,请参见 (1992年卡普).

正如在 (1992年卡普) one measurement of the cost incurred by an ‘on-line’ algorithm is it’s ‘competitive ratio’, 成为“最大值”, 对所有可能的输入序列求和, of the ratio between the cost incurred by the on-line algorithm 和 the cost incurred by an optimal off-line algorithm. An optimal on-line algorithm is then one whose competitive ratio is least”. 当然,找到这个比率说起来容易做起来难, 和 also note that not all “在线” algorithms have matching ‘offline’ equivalents.

B. P. 结果

的 study of such problems that gave rise to such algorithms started long before the Internet came into being. It is fascinating to note the dates of when such work was done to generate the algorithms that are so critical to computing, 今天的电信和互联网. With regard to the “在线” variance algorithm, this arose from the work of Scientist B. P. 威尔福德出版了 1962年的一篇论文 on “a Method for Calculating Corrected Sums of Squares 和 产品” with an offered solution where values are used only once 和 need not be stored.

标准偏差

标准偏差 是对一组数据点离散度的测量吗. A low value indicates that the set of data points tend to be close to the mean (the expected value). In simple terms the 标准偏差 of a set of data points is the square root of variance. Which then raises the question of how to find the variance on a sequence of data where the number of data points 和 their values is unknown. 这就是“在线”方差由B完成的地方. P. 威尔福德是相关的. 的re are two ways of finding the variance depending on if calculating an estimate of the 标准偏差 for a given population from a ‘sample’ data set or calculating using the complete ‘population’ data set. 下面的代码示例显示了这两种方法的方法.

的 following code example is that of an implementation in Java of a class called ‘St和ardDeviation’ in which an “在线” algorithm is used as the basis of finding the 标准偏差 for a sequence of individual data points. 它是基于一个实现的例子 塞奇威克和韦恩. 还有一个 JUnit测试类 with two test 方法s – each of which use the same sequence of numbers as data – but where one version tests the result for the ‘sample’ st和ard deviation, 另一个是总体标准差.

带有标准偏差的Java类(在线算法实现)

In the class above the setValues 方法 will accumulate the data as it is fed each value from a sequence, 一次一个. 简单的, yet significant difference in 计算 between ‘sample’ or ‘population’ can be seen in the two 方法s getVarianceSample ()getVariancePopulation (). Once the variance is found then the 标准偏差 is the square root of this variance.

具有标准偏差测试方法的JUnit类

In the JUnit test class above are 2 almost identical test 方法s that exercise the 标准偏差 class. Steps [1] to [3] in both 方法s are identical – [1] there is a new object created, [2]生成一个数字列表, with identical values for both test 方法s 和 [3] the list (sequence) of numbers is feed one number at a time to the 'sd.setvalue()” 方法. 的 difference between the 方法s is seen in the [4] assertEquals – in one 方法 we get the 标准偏差 using the ‘sample’ variance 计算 while in the other 方法 we use the ‘population’ 计算. 可以看到,结果值非常不同.

参考文献

Note that Logentries does not take any responsibility for the content of external websites.

  1. 视频:逻辑查询语言(LEQL)
  2. 维基百科:标准差
  3. 维基百科:方差
  4. 维基百科:在线算法
  5. JUnit
  6. 大学网站:算法,第4版. 作者:塞奇威克和韦恩
  7. 引用本文:结果, B.P., 1962. 关于计算修正平方和和乘积的方法的说明. 技术计量学,4(3),pp.419-420.
  8. 引用本文:Karp, R.M.1992年8月. On-line algorithms versus off-line algorithms: How much is it worth to know the future?.
    在IFIP大会(1)(卷. 12, pp. 416-429).

Use the LEQL analytic function 标准偏差 to examine the distribution of data points in your machine data. 今天就免费试用30天, 开始.