Bayesian Time Series Techniques vs. Traditional Westinghouse Rules

Jul 22, 2014: Bayesian Time Series Techniques vs. Traditional Westinghouse Rules

At Analytical Flavor Systems, our job is to monitor our clients’ products for flaws, contaminations, and batch variations in real time. Clients review their products regularly, assigning each a score from 0 to 5 along 24 universal flavor dimensions. Additionally, every reviewer assigns the product an overall Perceived Quality (PQ) score between one and seven.

Our algorithms run 24 hours a day, 365 days a year, protecting our clients from shipping a bad batch and hurting their brand. Our clients are alerted to any significant drop in quality through email and phone alerts. This blog post will explain how we updated our models from the Western Electric rules to a more precise and accurate Adaptive Bayesian model.

The image above is a representation of the Western Electric rules we used to use for data analysis. Each data point represents the weighted average (or \(\bar x\)) of a week’s worth of Perceived Quality values. The grey line represents the mean of all Perceived Quality values we have for this product. The other horizontal lines represent distance from the mean (in standard deviations).

The original model we used was the Western Electric rules, which operates on the assumption that the \(\bar x\) values (or sequences of \(\bar x\) values) outside a certain number of standard deviations from the mean have a very low probability of being generated by random processes, and could thus be a symptom of meaningful variation (such as a beer flaw like dimethyl sulfide).

For example, say that in the last few weeks, three overall Perceived Quality values were less than one standard deviation below the mean. Frequentist statisticians will posit that the the odds of such an event occurring (if the mean is where we expect it to be and the variations were generated by random chance) are less than 1%. They would then assume that the variation is caused by a contamination and alert the producer. Though this model is correct, it has serious limitations in its accuracy and ability to flag flaws as they are occurring.

Data Scientists and Mathematicians will note that the Western Electric Rules rely heavily on Frequentist assumptions and a Normal distribution, which does not always fit our data very well. Since we have upper and lower limitations on our PQ scores, the normal distribution is a bad estimate for scores close to 1 or 7. Behind the scenes of the Bayes model, we assume that the data follows a Poisson, rather than Normal distribution, because this will change its variance based on how high or low the expected PQ values are.

Furthermore, there are several confounding factors in the data that the model needs to be robust enough to account for. For example, a person's mood has an unconscious impact on how much they enjoy a taste; this could mean that reviews done on Monday have a lower-than-normal Perceived Quality score on average. Similarly, there could be long term seasonal effects such as a preference for stouts in the winter and India Pale Ales in the summer—these are perception shifts in the underlying population of reviewers and consumers of our clients' products, not true batch variations.

In the figure above, we see the same data, analyzed with Bayesian rules instead of Frequentist ones. The yellow line represents \(PQ_{normal}\) the value we would expect the Perceived Quality to be at if there were no contamination. The blue line represents \(PQ_{low}\) a drop in Perceived Quality significant enough that we want to alert the brewer. The yellow points deviate significantly from the \(PQ_{low}\) hypothesis, and the blue points deviate significantly from the \(PQ_{normal}\) hypothesis.

We updated our model to the Bayesian Poisson statistic as it gives us more useful information. After we generate two hypotheses (\(PQ_{normal}\) and \(PQ_{low}\)) from past data, the model tells us how likely it is that the Perceived Quality is at a normal value versus how likely it is that the Perceived Quality has dropped significantly (to \(PQ_{low}\)) based on the data we have. This is more useful than the Western Electric statistic, which only gives us the probability that the data was generated by a normal Product.

When deciding how much the data supports either hypothesis, the Bayesian statistic takes two factors into account.

\(P(PQ_{low})\) : Based on past information, how likely is it that this hypothesis is true (i.e. what is the probability of a contamination in beer X?).
\(P(data | PQ_{low})\): If \(PQ_{low}\) were true, how likely is it that we would see this data?

For example, if you heard a banging, rattling noise in your attic, you might guess that the noises are made by either a ghoul playing the snare drum or a racoon riding a skate board. For this hypothesis, \(P(Ghoul)\) would obviously be very low while \(P(Racoon)\) would be higher (assuming you don’t live in Tahiti, where there are no raccoons but lots of ghouls).

Continuing the model, knowing the behavioral patterns of Ghouls and Racoons, \(P(banging | Ghoul)\) would be barely believable while \(P(banging | Racoon)\) would be very high. Both factors are important when considering any theory, and the Bayesian statistic does a good job balancing both—assuming you start with the right assumptions and enough background data to determine the underlying probabilities of your intended inference.

We calculate the \(P(data | PQ_{low})\) and \(P(data | PQ_{normal})\) using our database of over 10,000 coffee reviews, 6,000 beer reviews, and thousands of reviews across our clients other products. We assume a Poisson distribution, which looks similar to a normal distribution. The difference is, instead of considering the distance of each point from the mean in Standard Deviations, it takes in only the number of points that deviate significantly from the mean then compares it to the number of points that we expect to deviate significantly (given normal random variations).

We find \(\lambda\), the expected number of deviations, from the probability of generating a bad review (\(P_{badrev}\)). Now that the initial model is built we have to take into account the other confounding factors and expand the model to an \(\mathbb{R}^{24}\) Joint probability distribution.

This functions as a conceptual map for the entire Bayesian Western Electric classifier. We use Neural Nets and feature extraction to generate hypotheses, adjusting for confounding factors that have influenced PQ scores in the past (like day of the week). We then use the Poisson distribution and the Bayesian statistic to calculate how much the data supports either hypothesis, and if \(P(PQ_{low})\) ends up being higher than \(P(PQ_{normal})\), we alert the brewer.

We have also expanded the same classifier to monitor the other 24 dimensions of flavor, which checks for certain kinds of contamination. Because several of the flavor variables are correlated with each other, each type of contamination tends to raise or lower several of the variables (usually around 50%) to irregular levels. If more than 20% of the variables are flagged for being higher or lower than they should, we assume there is some contamination in the model (as the probability of 20% of the variables being flagged by accident is extremely low).

This, along with the initial truth values for the Bayesian model, will be found using not only the means and standard deviations of the data points, but also a Neural Network and Principal Component Analysis to isolate key variables. This again is because our perception of taste changes depending on a large variety of factors. All these factors must be accounted for by this model. In contrast to the standard Western Electric Frequentist Analysis, which must operate under strict, simplistic normal assumptions, our model is able to dynamically infer what a normal and contaminated batch would look like given conditions we have encountered in the past.

We are pleased that our new Bayesian model will find flaws with more accuracy than the Frequentist model. However, what we are really excited about is the fact that our models give us more usable information. For example, using feature extraction to dynamically predict the value of the normal and low Perceived Quality values, we can determine which correlating factors have the biggest impact on Perceived Quality.

This information might be useful in letting producers know how they can optimize their brewing or reviewing process. Thus, instead of just flagging an unexplained variation, our model will eventually return a full diagnosis, so that producers know what they need to fix.

If you’re a producer of beer, coffee, or spirits who wants to benefit from a quality control, sign up for our 31 day free trial here. If you’re a data scientist passionate about craft products and applying data science to human sensory data, check out our openings here.

Sidharth Dhawan

Data-Science & Web Design Intern

Sidharth is a web design/data science intern at Analytical Flavor Systems. He studies computer science and math at Princeton University. When he isn't busy with his duties as a student and intern, he enjoys making artwork, swimming, and curling up with a good book.