Thursday, November 05, 2009
Data Mining for Satisfying the Finicky
So we have cats. Three, currently - Caesar, a rescue cat, Lenora, a shelter cat and Gabby, a stray cat - out of a lifetime population of five, including Nero, the brother of the rescue cat, who disappeared (probably eaten by coyotes), and Graycat, another stray cat, pictured, who we unfortunately had to have executed by the state (because only I could handle him, using gloves, and we were afraid he was going to come knife us in our sleep).
So the three remaining cats are somewhat finicky. There are foods they will love, foods they will grudgingly eat, food they will eat but puke up, and food they will (quite literally) try to bury as if it is crap. So I've been meaning for a long time to keep up a diary of the food choices and their reactions to find out what we can feed them.
Data mining researchers claim that getting high-quality input data is the hardest part of a machine learning problem, so I started off with some exploratory data collection in Excel. After letting (thoroughly washed!) cans pile up for a week in two bins, I entered these into a spreadsheet and started to figure out how the data should be represented. I ended up with these columns:
After collecting this data, I started to analyze it. First I sorted the data. Then I eliminated duplicates and added a Servings, AggregateRating and Average column, summing up the Ratings into the Aggregate so that if something got two +1 and one -1 rating it would get 3 Servings and a AggregateRating of 2. This I used to compute an Average, which I used to resort the table to see which brands worked best.
The problem is, this Average wasn't that meaningful. One vote for a flavor isn't as meaningful as three, because the cats aren't consistent. This is the inverse of the Law of Large Numbers: you need many ratings to generate a meaningful result in the presence of noise.
I decided to set the number of ratings I cared about at 3, based on anecdotal comments by Roger Schank, my thesis advisor's thesis advisor - who reportedly said you need to visit a restaurant three times to give it a fair rating, because a restaurant could have one off day or great day and you needed at least 3 ratings to get an idea of their consistency.
At first I decided to track this using a smoothed average, AggregateRatings / (Servings + 3), but this depressed the all-positive and all-negative scores more than I liked - that kind of smoothing function works only well if you have very large ranges of values. So I chose a simpler max-based approach of AggregateRatings / Max(Servings, 3), so that one serving would get a 33% positive or negative rating but three or more could max it out to 100% if they were consistent.
That enabled me to make some findings, but then I realized I'm an idiot. I'd picked up the smoothed average idea from Empirical Methods for Artificial Intelligence, a book any serious computer scientist should read. And I'd edited my data in the spreadsheet so I could compute that average. But what I should have been thinking about was The Pragmatic Programmer, specifically the tips Keep Knowledge In Plain Text and Use Source Control.
Why Keep Knowledge In Plain Text? The cats aren't just finicky; their tastes change, especially if you overfeed them one thing. So the date at which a cat turns on food is important. By entering it into Excel, I first had to have a computer on hand, which encouraged to let the cans pile up; so I lost both the date information and some of the rating information - a coarse grained +1/-1 rather than "Ate Instantly"/"Ate Completely"/"Left Unfinished"/"Refused or Puked Up"/"Tried to Bury". A superior strategy would have been a pen-and-paper notebook where I recorded the cans a few hours after they were eaten. This could be entered into a text file a few days later, and if it is tab or comma separated Excel could easily import it. Then, with that data, I could even have applied other techniques from Empirical Methods for Artificial Intelligence, like using a sliding time-series window to ensure I'm analyzing the cat's current tastes.
And why Use Source Control? Because I edited my Excel file, dummy, not even versioned with v1 v2 v3 like I do with documents. So I actually entered this data in two phases and some of the temporal information I could have recovered has been lost.
So I'm going to improve my procedures going forward. Nevertheless, I did get some nice preliminary data, which jibes well with the observations Sandi and I had made informally. I'm going to hold judgment until I have more data, but so far Fancy Feast is the best brand, and Cod, Sole and Shrimp Feast and Savory Salmon Feast are the winningest flavors. Newman's Own Organics and Halo Spot's Stew were the worst brands - the cats refused to even touch them - which is odd, because Newman's Own makes great human food (try Newman O's) and Halo makes great dry food the cats love.
More results as the votes continue to trickle in...
-the Centaur
So the three remaining cats are somewhat finicky. There are foods they will love, foods they will grudgingly eat, food they will eat but puke up, and food they will (quite literally) try to bury as if it is crap. So I've been meaning for a long time to keep up a diary of the food choices and their reactions to find out what we can feed them.
Data mining researchers claim that getting high-quality input data is the hardest part of a machine learning problem, so I started off with some exploratory data collection in Excel. After letting (thoroughly washed!) cans pile up for a week in two bins, I entered these into a spreadsheet and started to figure out how the data should be represented. I ended up with these columns:
- Brand: Fancy Feast, Nutro, etc.
- Type: Regular, Max Cat Gourmet Classics, etc.
- Flavor: Savory Salmon Feast, White Meat Chicken Florentine with Garden Greens, etc.
- Consistency: Flaked, Pate, Grilled, etc.
- Target: Adult or Kitten
- Package: Can, Tray or Packet
- Ratings: +1 or -1
After collecting this data, I started to analyze it. First I sorted the data. Then I eliminated duplicates and added a Servings, AggregateRating and Average column, summing up the Ratings into the Aggregate so that if something got two +1 and one -1 rating it would get 3 Servings and a AggregateRating of 2. This I used to compute an Average, which I used to resort the table to see which brands worked best.
The problem is, this Average wasn't that meaningful. One vote for a flavor isn't as meaningful as three, because the cats aren't consistent. This is the inverse of the Law of Large Numbers: you need many ratings to generate a meaningful result in the presence of noise.
I decided to set the number of ratings I cared about at 3, based on anecdotal comments by Roger Schank, my thesis advisor's thesis advisor - who reportedly said you need to visit a restaurant three times to give it a fair rating, because a restaurant could have one off day or great day and you needed at least 3 ratings to get an idea of their consistency.
At first I decided to track this using a smoothed average, AggregateRatings / (Servings + 3), but this depressed the all-positive and all-negative scores more than I liked - that kind of smoothing function works only well if you have very large ranges of values. So I chose a simpler max-based approach of AggregateRatings / Max(Servings, 3), so that one serving would get a 33% positive or negative rating but three or more could max it out to 100% if they were consistent.
That enabled me to make some findings, but then I realized I'm an idiot. I'd picked up the smoothed average idea from Empirical Methods for Artificial Intelligence, a book any serious computer scientist should read. And I'd edited my data in the spreadsheet so I could compute that average. But what I should have been thinking about was The Pragmatic Programmer, specifically the tips Keep Knowledge In Plain Text and Use Source Control.
Why Keep Knowledge In Plain Text? The cats aren't just finicky; their tastes change, especially if you overfeed them one thing. So the date at which a cat turns on food is important. By entering it into Excel, I first had to have a computer on hand, which encouraged to let the cans pile up; so I lost both the date information and some of the rating information - a coarse grained +1/-1 rather than "Ate Instantly"/"Ate Completely"/"Left Unfinished"/"Refused or Puked Up"/"Tried to Bury". A superior strategy would have been a pen-and-paper notebook where I recorded the cans a few hours after they were eaten. This could be entered into a text file a few days later, and if it is tab or comma separated Excel could easily import it. Then, with that data, I could even have applied other techniques from Empirical Methods for Artificial Intelligence, like using a sliding time-series window to ensure I'm analyzing the cat's current tastes.
And why Use Source Control? Because I edited my Excel file, dummy, not even versioned with v1 v2 v3 like I do with documents. So I actually entered this data in two phases and some of the temporal information I could have recovered has been lost.
So I'm going to improve my procedures going forward. Nevertheless, I did get some nice preliminary data, which jibes well with the observations Sandi and I had made informally. I'm going to hold judgment until I have more data, but so far Fancy Feast is the best brand, and Cod, Sole and Shrimp Feast and Savory Salmon Feast are the winningest flavors. Newman's Own Organics and Halo Spot's Stew were the worst brands - the cats refused to even touch them - which is odd, because Newman's Own makes great human food (try Newman O's) and Halo makes great dry food the cats love.
More results as the votes continue to trickle in...
-the Centaur
Labels: Development, The Cats
// posted by Anthony Francis @ 12:27 AM Permalink
Comments: