Abstract:
With the evolution of the Internet and continuous growth of the global information infrastructure,
the amount of data collected online from transactions and events has been
drastically increased. Web server access log files collect substantial data about web
visitor access patterns. Data mining techniques can be applied on such data (which is
known as Web Mining) to reveal lot of useful information about navigational patterns.
In this research we analyze the patterns of web crawlers and human visitors through
web server access log files. The objectives of this research are to detect web crawlers,
identify suspicious crawlers, detect Googlebot impersonation and profile human visitors.
During human visitor profiling we group similar web visitors into clusters based
on their browsing patterns and profile them.
We show that web crawlers can be identified and successfully classified using heuristics.
We evaluated our proposed methodology using seven test crawler scenarios. We
found that approximately 53.25% of web crawler sessions were from â ˘ AIJknownâ˘A
˙I
crawlers and 34.16% exhibit suspicious behavior.
We present an effective methodology to detect fake Googlebot crawlers by analyzing
web access logs. We propose using Markov chain models to learn profiles of real and
fake Googlebots based on their patterns of web resource access sequences. We have
calculated log-odds ratios for a given set of crawler sessions and our results show that
the higher the log-odds score, the higher the probability that a given sequence comes
from the real Googlebot. Experimental results show, at a threshold log-odds score we
can distinguish the real Googlebot from the fake.
For the purpose of human visitor profiling, an improved similarity measure is proposed
and it is used as the distance measure in an agglomerative hierarchical clustering for
a data set from an e-commerce web site. To generate profiles, frequent item set mining
is applied over the clusters. Our results show that proper visitor clustering can be
achieved with the improved similarity measure.