Thursday, 02 August 2012

PAGE-RANK

1. Introduction
Page-rank is a link analysis algorithm named after Larry Page (founder of Google) and is used by the Google search engine, which assigns numerical weighting to each element of a hyper-linked set of documents such as the World Wide Web. 

Google has become the most utilized search engine in the world. This is due to the high quality
search results in comparison to other search engines. The high quality results are due to the
page-rank feature which is used by Google. Every document is assigned with a ranking while it is
being indexed. When a person searches for a keyword on the internet (using any search engine),
pages with higher page-rank relative to others are displayed at the top of the search results. This
document brie y describes the Page-rank algorithm (used by Google) in two methods; the linear
equations method easy and the iterative method.

2. Page rank

Page-rank is an algorithm which was proposed by Brin and Page in the 1990s. This algorithm is the
core of their search engine Google. There are two ways in which this algorithm can be implemented
in software, but the results will still be showing the same thing, page-ranks. What page-rank is
doing is that it ranks a page relative to other pages in the same network, so it does not just assign
a number to a page, the number is relative to other pages.
The page-rank algorithm as described by Lawry Page and Sergy Brin in several publications is
given by the equation below:

PR(pi) = (1 - d)/N + d  * PR(p) /L(p)

Where p1; p2; ...p(n) are the pages under consideration, M(p) is the set of the pages that link to
p, L(p) is the number of outbound links on page p, n is the total number of pages and d is the
damping factor, which is between zero and one.
When page rank is being calculated, pages with no outbound links (links to other pages) are
assumed to link out to all other pages in the collection. Their page-rank scores are therefore divided
out evenly among all other pages. In other words, to be fair with pages that are not sinks (dangling
pages with no links, e.g pdf and pictures), these random transitions are added to all nodes on the
network with a residual probability d which is described above.
From equation 1 above, it is clear that the page-rank of p is recursively de ned by the page-ranks
of those pages which pi link to.

3. The iterative Computation of page-rank

The size of the actual web is very large, so in order for the search engine to compute the page-ranks
of the pages faster, an iterative method is used. The method described above in section 2.1 solves
linear equations, and it will be expensive if the matrix to solve is 1000*1000 or over.
Thus Google search engine uses an approximation iterative computation of page-rank values.
The iterative method which was used when the page-rank algorithm was rst found is Power
Method. Power Method was used simply because it is simple to implement in software and yet very
powerful.
The way the page-rank is calculated is in the following way:
Directed graphs are used to represent the links to the web pages. A directed graph G, is a pair
of sets (V;E), where the elements of V are called vertices and the set E is contained in a V * V
matrix is called the edge set. Figure 1 below depicts a small web page, just for illustration.
In this graph, the vertices can be labeled V = (1;....; 4) and the edges
E = f(1; 2)(1; 3)(1; 4)(2; 3)(2; 4)(3; 1)(4; 3)(4; 2)g, since there are directed graphs, (1; 3) is not
the same as (3; 1).


Figure 1: The small web-page for illustration


4. Markov Chain from graph

An adjacency matrix of the graph can be de ned, given a directed graph G = (V;E) with N
vertices, the adjacency matrix LG of the graph G which is an N*N matrix can be de ned with
entries:
so the adjacency matrix
and the number of 1s in the adjacency matrix is equal to the number of edges.
Constructing a Markov chain (Stochastic Matrix) given a graph is trivial, the only aspect which
needs to be taken care of is to deal with pages with no outbound, the method is described below
for developing a Stochastic Matrix:


Then the page-ranks p = (p1; p2; ..... ; pN)transpose can be calculated using the stochastic matrix A
using power method, but in order to include the damping factor, equation 1 above can be used, but
written in another form:


where : is the damping factor, L is the stochastic matrix of normalized links, 1 and v is the
the row vector of ones.
Then the page-ranks can be calculated using power method:


where phi is the eigenvector of matrix G which will have the page-ranks of the pages.

5. Discussion
The page-rank algorithm can be implemented in two ways described above, on this project they
were both implemented for illustration, but the one which is more preferable is the one were page-
ranks are calculated iteratively. The reason being that it is fast ans also reliable when dealing with
a network of millions of pages which needs to be ranked accordingly.

The code for page-rang can be found here.





No comments: