By contrast, external models attempt to paint a picture of (or extract a set of rules that govern) normal behavior by observing input and output statistics over time. What is normal? That's a difficult question, worthy of PhD research. Suppose that I look at the number of incoming HTTP requests versus disk reads on a web server during a given time period. I might notice a strong positive correlation during this period; as the requests increase so do the number of reads. If I am quick to define this relationship as normal, I would be ignoring several important facts:
- correlation is not causality; while it may be true that incoming requests are the primary driver of disk reads, they are not likely to be the only driver. There is also the classic statistical maxim that a third, or any number of other known or hidden drivers is influencing both variables.
- my sample is too small; it could be that my observations fall during a period when there is a strong correlation, but further sampling would reveal a much weaker correlation, or a cyclic correlation.
- caching kicks in; depending on the nature of the requests during the observation period, caching may cause disk reads to drop precipitously after a sufficient time.
Sadly, there is usually a big difference between an ideal and reality. While inputs and outputs may be correlated in some way, they may not be related by functions, or if they are, the functions may not be easily extracted.