next up previous
Next: The Customer as a Up: Grouping supermarket purchases by Previous: Grouping supermarket purchases by

Minimizing anomaly in assignments of baskets to customers

 

One approach involves minimizing total anomaly in the assignment of baskets to customers.

Definition: A partial assignment tex2html_wrap_inline198 groups some of the baskets of purchases according to whether they were purchased by the same customer. Each group also includes an identifier c for the customer and a classification class(c) of the customer. The set of baskets associated with the putative customer c will be denoted by baskets(c).

Definition: A complete assignment groups all of the purchase baskets.

If there are N baskets in the database, there are something like tex2html_wrap_inline210 complete assignments--less because the customers may be permuted.

Definition: Associated with each assignment will be a numerical total anomaly measuring how anomalous the assignment is. The program's goal is to find an assignment (or maybe many assignments) that minimize the total anomaly.

The total anomaly tex2html_wrap_inline212 of an assignment tex2html_wrap_inline198 is the sum of two main terms,

equation62

tex2html_wrap_inline216 is itself a sum

equation64

where the variable c ranges over the set of customers to which the baskets are assigned. tex2html_wrap_inline220 concerns global properties of the set of assignments.

Definition: Associated with an assignment tex2html_wrap_inline198 and a customer c is a characterization tex2html_wrap_inline226 of the putative customer. The characterization may include qualitative characeristics like sex or owning a freezer, quantitative characeristics like age or income group and other customer characteristics like a certain purchase signature. The anomaly anom11(c) associated with a customer c depends on the characterization tex2html_wrap_inline226 . Thus buying chewing tobacco or baby food is more anomalous for some customers than others. A program that generates assignments will generate characterizations as it groups the baskets by customer. The characterization itself will contribute to the anomaly if it is an unusual characterization.

Definition: A signature is a set of choices among alternate brands or sizes of certain commodities. The commodities most useful for signatures are those for which variety is not normally considered desirable. While a person may want variety in food he is unlikely to want variety per se in dish-washing soap, toilet paper or size of dog food. Signatures are included in the characterization of a customer.

The part of the anomaly anom11(c) associated with the putative customer c is computed relative to the characterization. Thus if c is characterized as single young female, a purchase of chewing tobacco should have a higher anomaly score than for a male.

One way of looking at minimizing anomaly of assignments is that we want to explain as much of the purchasing behavior as possible by allowable characterizations of the customers.

We regard the notions of minimizing anomaly in the space of assignments as a guiding theoretical idea. Programs may find complete assignments, but they are unlikely to do it by comparing a large number of alternative complete assignments. Instead they are likely to do hill climbing in the space of partial assignments.

Here are some kinds of terms that may be associated with the customer part of the anomaly function.

  1. A measure of the temporal irregularity of the customer's purchases. Perishable, non-freezable items like milk need to be purchased at a fairly regular rate. If baby food is purchased, it also is consumed at a regular rate, although it can be stored. Some customers will be very irregular, but an assignment shouldn't make most of them irregular.
  2. A measure of the extent to which the grouped baskets do not fit the characterization tex2html_wrap_inline226 .
  3. Signatures involving a large variation in brands of certain items should contribute to the anomaly.
  4. A lot of variation in a putative customer's purchase quantity of a frequently bought item. This suggests that the same person didn't buy all those baskets.
  5. A customer buys food, stores it for a while and eats it. Thus the contents his larder is a function of time. The database tells about the purchasing but not directly about the eating or the state of the larder. We can attribute a larder function of time to a customer as part of the ascription and use some measure of its irregularity as a component of the anomaly.

Here are some ideas about programs for finding assignments.

  1. We hill climb in the space of partial assignments. For example, moving a purchase from one customer to another may reduce the anomaly of both customers' ascribed larder functions.
  2. We might proceed chronologically, assigning each basket to either a previously postulated customer or to a new one.
  3. At first new customers would predominate. However, when the number of postulated customers begins to get too large for the number of baskets, the program would try to reduce the number by combining baskets.

How can it be inferred that several cash purchases involved the same customer? We only need to be correct often enough so that the statistics come out right. Each customer has his own pattern of purchases. Here are some considerations.

  1. The signature tex2html_wrap_inline242 is a purchase pattern unique to the customer c. Consider items where variety is not normally desired, e.g. dishwasher soap. There are several brands, but a customer will normally stick with one for quite a long time. If there are 5 brands and 50 such kinds of items, there are enough possible signatures to distinguish far more customers than a store or even a chain is likely to have. Of course, a customer is unlikely to purchase a complete signature package each time he goes to the store, so partial signatures will have to be used.
  2. The ingredients for particular recipes are sometimes diagnostic, especially when the recipe is unique to the customer or is a standard recipe varied in a unique way.
  3. An important intermediate variable for a customer is the state of his larder at a given time. He likes to have certain items in stock in his refrigerator or freezer.
  4. The customer makes choices in a certain pattern, e.g. buys creamy rather than chunky peanut butter. Which choices are made is more indicative than whether peanut butter is bought at all on a particular occasion, since the customer may not have run out yet.
  5. Suppose a store has 10,000 items and has 12,000 customers. Suppose purchases average 20 items. My information theory intuition suggests that there is enough information to identify the customers over some 20 shopping trips. The information theory numbers can be analyzed, but experiment is still required to determine feasibility.
  6. Sometimes it will be impossible to assign a basket to a customer. As an extreme example, suppose that withing ten minutes two customers each buy a six pack of the same brand of beer and nothing else. Which one made which purchase will be impossible to tell, but it won't matter which purchase is assigned to which customer.


next up previous
Next: The Customer as a Up: Grouping supermarket purchases by Previous: Grouping supermarket purchases by

John McCarthy
Thu Apr 6 16:23:28 PDT 2000