Here’s the scenario: you scope out a bunch of different
establishments (documents) across town, making note of the people (words) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don’t know the typical interest groups (topics) of each establishment, nor do you know the different interests of each person.
So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it’s because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it’s because the Z people in this city really like to watch movies; and so on.
Of course, your random guesses are very likely to be incorrect (they’re random guesses, after all!), so you want to improve on them. One way of doing so is to:
- Pick a place and a person (e.g., Alice at the mall).
- Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.
- In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.
- So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.
- For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the “basketball players” group).
- For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes & Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.
Source: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
No comments:
Post a Comment