Tuesday, August 26, 2008

data preprocessing of web usage mining


One of the huge difficulties of web usage mining is data preprocessing. The most common form of input data is a web server log in CLF or ECLF format as above. It should end up with a list of server sessions. A lot of research has been done on data preprocessing. The well known doctoral thesis by Robert Walker Cooley presented a detailed process of web usage data preprocessing, including data cleaning, session identification, pageview identification, path completion and episode identification. But the programming is still a problem. Does anyone have the scource code or the ready made preprocessing tools? Please contact me through clearking@gmail.com.