Monday, April 6, 2009

spectral clustering[quoted]

In machine learning the other day, we learned about spectral clustering. This has got to be one of the most beautiful algorithms I've seen in months. It doesn't require you to come up with any kind of model for how your data was generated; all you need is a similarity measure between any two data points (e.g. exponential fall off). The measure doesn't have to satisfy the triangle inequality (i.e. it doesn't have to be transitive), but it should be symmetric. Here's how the algorithm works:

If you have N data points, then define an NxN matrix M where the value of Mij is the distance from point i to point j. Now take the second eigenvector of this matrix. Partition the data points according to whether the corresponding entry in the second eigenvector is positive or negative. Ta-da! Now you have a partition of the entries into two sets. You can use the eigenvalues (as opposed to the eigenvectors) to determine how strong the partition is. If you want to make sure that your clusters are roughly the same size, then instead of partitioning based on positive/negative, you can sort the values first and then partition around the median value.

If you need more than two clusters, you can either recursively subdivide binary partitions, or you can scatter-plot points from more than one eigenvector and then use a different clustering algorithm on the result. For example, if you need three partitions, then you can plot the second eigenvector against the third eigenvector, and use k-means clustering on the resulting 2D data set. For more clusters, use more eigenvectors.

The reason this algorithm is beautiful is because of the robustness of the result, the fact that it doesn't need an underlying model to generate the points (such as a sum of gaussians), and the intuitive interpretation of what the eigenvectors mean.

Here's the intuition. Assume that you start out with a pile of ants at each data point, and at each time step each ant traverses an edge, to get to some other data point. The probability of an ant choosing an edge is inversely proportional to the edge's length (because ants are lazy). Now if you let this situation go for an infinite amount of time, the first eigenvector will tell you the expected number of ants at each data point. The second eigenvector, meanwhile, will tell you what happens if you run for just short of an infinite amount of time.

The first eigenvector is really useful for some purposes. For example, the Google PageRank algorithm consists of nothing more than computing the first eigenvector -- replace data points with web pages, edges with hyperlinks, and ants with people, and the first eigenvector will tell you how popular each web page is.

For clustering what you want is the second eigenvector. The idea is that if you start off an ant in one cluster, then it should be more likely to wander around inside that cluster for a while before finally making its way over to some other cluster. That is, points in the same cluster should be more connected to each other than to points in other clusters. The second eigenvalue directly captures this notion, in a simple mathematical way.

Here's a picture of the result for a problem that isn't linearly separable (shamelessly ripped off from my teacher's lecture slides):

There are a couple mild caveats to this technique. First, all the theory seems to assume that your clusters are only sparsely connected. So the recommended technique is to to only connect each point with it's k nearest neighbors. I'm not sure what happens if you fill out every entry in the distance matrix. Second, the web of points is generally assumed to be connected. These caveats are in direct opposition to each other -- you have to choose k carefully in order to ensure a sparse, yet connected, matrix. I don't yet have enough of an understanding to know why these two things are needed. It may just be that computation of eigenvectors for non-sparse matrices is more computationally intensive than for sparse matrices.

I wonder what interpretation can be assigned to complex eigenvectors...

Tuesday, August 26, 2008

data preprocessing of web usage mining


One of the huge difficulties of web usage mining is data preprocessing. The most common form of input data is a web server log in CLF or ECLF format as above. It should end up with a list of server sessions. A lot of research has been done on data preprocessing. The well known doctoral thesis by Robert Walker Cooley presented a detailed process of web usage data preprocessing, including data cleaning, session identification, pageview identification, path completion and episode identification. But the programming is still a problem. Does anyone have the scource code or the ready made preprocessing tools? Please contact me through clearking@gmail.com.


Thursday, November 23, 2006

web usage mining

i'm doing some study on web usage mining right now. i'm finding some ways to compute the similarity of two web users. someone has used cos of their common paths, but how to define the cos function?