We've been talking about Eigenvalues and Eigenvectors of matrices and, in particular, methods to compute these. These ideas were used by the founders of Google to great effect to pick out the most relevant websites. I'm going to go through the basics of how this works following this article:
This article is quite readable and aimed at undergraduate students.
I will not directly test you on material discussed here. However some important concepts are discussed:
scipy
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import math as m
from mpmath import mp, iv
from scipy import linalg
A directed graph consists of a collection of nodes or vertices together with directed edges joining one node to another.
A directed graph naturally arises from thinking about webpages. The nodes are webpages, and you get a directed edge from webpage_0
to webpage_1
if webpage_0
which points to webpage_1
.
An example of a directed graph is shown below.
Each circled number represents a webpage. There is only one arrow leaving ③ so this indicates that webpage_3
only has one link: a link to webpage_1
. However both webpage_1
and webpage_4
link to webpage_3
.
Given a directed graph $G$ with $n$ nodes, we associate an $n \times n$ hyperlink matrix $H$. If webpage_i
has a total of $l_i$ links and webpage_j
is another page, we define $H_{ij}=\frac{1}{l_i}$ if webpage_i
links to webpage_j
and $H_{i,j}=0$ otherwise.
The hyperlink matrix of the above directed graph is
$$H = \left(\begin{matrix}
0 & \frac{1}{3} & \frac{1}{3} & 0 & \frac{1}{3} \\
\frac{1}{2} & 0 & 0 & \frac{1}{2} & 0 \\
0 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & \frac{1}{2} & \frac{1}{2} & 0 \\
\end{matrix}\right).$$
Observe that if webpage_i
has outgoing links we have
$$1 = \sum_{j} H_{ij}.$$
We think of the entries in row $i$ as representing the probability that a "random web surfer" will move from webpage_i
to the other webpages. Because we weight each page linked to evenly, our "random web surfer" will choose a linked page with equal probability.
A matrix is called sparse if most of its entries are zero. A dense matrix is one in which most of the entries are non-zero.
Sparse matrices are interesting from a the viewpoint of computational complexity (speed). For example multiplying a dense $n \times n$ matrix by a dense column vector has complexity $O(n^2)$. But using an $n \times n$ matrix with $N$ non-zero entries will have complexity $O(N)$, which can be significantly faster (e.g., if $N=cn$ with $c$ a constant much smaller than $n$).
It is reasonable to think that a hyperlink matrix is sparse: If you choose two websites at random, it is very unlikely that there would be a direct link between them. I think it is reasonable to think that the number of non-zero entries in an $n \times n$ hyperlink matrix is roughly $c n$, where $c$ is the average number of pages linked to by a webpage. This $c$ is presumably a finite constant so matrix-vector multiplication will be of order $O(n)$, which is much faster than $O(n^2)$ especially if $n$ is very large. (Wills' article mentions that "A 2004 investigation of Web documents estimates that the average number of outlinks for a web-page is 52," so $c$ could be about $52$.)
In our directed graph, webpage_2
is a dangling node, meaning that it has no outgoing links. With our definition of $H$ this means that all the entries in row $2$ are zero, but this breaks the pattern of entries in each row summing to one. We'd like to fix this in some way, replacing zero rows assoicated to dangling nodes with a row with all non-negative entries summing to one. A choice of a method to fix this is called a dangling node fix.
Wills' article suggests weighting all pages equally. In our case, the result would be the following
$$ \left(\begin{matrix} 0 & \frac{1}{3} & \frac{1}{3} & 0 & \frac{1}{3} \\ \frac{1}{2} & 0 & 0 & \frac{1}{2} & 0 \\ \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & \frac{1}{2} & \frac{1}{2} & 0 \\ \end{matrix}\right).$$However, I am not a fan of this method because it makese the matrix less sparse: Row $2$ will have no non-zero entries.
I'm going to suggest a different fix: following a back-link at random. (This is somewhat inspired by the thought that the random web surfer will just use the back button on their browser.) Our dangling node ② has edges coming from ⓪ and ④, so our resulting matrix is $$S=\left(\begin{matrix} 0 & \frac{1}{3} & \frac{1}{3} & 0 & \frac{1}{3} \\ \frac{1}{2} & 0 & 0 & \frac{1}{2} & 0 \\ \frac{1}{2} & 0 & 0 & 0 & \frac{1}{2} \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & \frac{1}{2} & \frac{1}{2} & 0 \\ \end{matrix}\right).$$
This matrix can be seen to be the matrix associated to the following graph (where we inserted arrows as part of the dangling node fix).
For later use, we store our matrix $S$ below.
S = np.array([[ 0, 1/3, 1/3, 0, 1/3],
[ 1/2, 0, 0, 1/2, 0],
[ 1/2, 0, 0, 0, 1/2],
[ 0, 1, 0, 0, 0],
[ 0, 0, 1/2, 1/2, 0]])
S
A (right) stochastic matrix is a square matrix all of whose entries are non-negative so that the entries in each row sum to one. The matrix $S$ above satisfies this condition.
Let ${\mathbf 1}$ be the column vector of all ones. Observe that if $S$ is a stochastic matrix $S{\mathbf 1}={\mathbf 1}$. That is ${\mathbf 1}$ is an eigenvector with eigenvalue one. We will use this vector in our example, so we define it below:
one_vector = np.array(5*[[1]])
one_vector
It follows from Gerschgorin’s Theorem (Theorem 10, in TAK § 4.6) that all eigenvalues of a stochastic matrix $S$ have absolute value less than or equal to one. The eigenvalue $1$ does not need to be the dominant eigenvalue: There can be other eigenvalues of absolute value one (which might be complex). Also, the eigenspaces associated to the eigenvalue $1$ may be larger than one dimensional. (For an extreme example, observe that the identity matrix is stochastic.)
If $x$ is a row vector then the sum of its entries can be computed as the dot product $x {\mathbf 1}$. A probability row vector is a row vector in which all entries are non-negative and $x {\mathbf 1}=1$. Observe that probability row vectors are closed under right multiplication by a stochastic matrix $S$. (That is, if $x$ is a probability row vector then so is $xS$.) Indeed $xS$ has all non-negative entries and $$(x S) {\mathbf 1} = x (S {\mathbf 1}) = x {\mathbf 1} = 1.$$
We intepret probability row vectors as probability distributions on the nodes or webpages we are considering.
The Perron-Frobenius Theorem is a fundamental result about matrices that is important to numerous areas of mathematics and its applications, including probability theory, economics, and geometry. Wikipedia has a comprehensive article on the subject. We will content ourselves with understanding the simplest case of the theorem.
The Perron-Frobenius Theorem. Let $A$ be an $n \times n$ matrix all of whose entries are positive. Then there is a positive real number $r>0$ which is an eigenvalue for $A$ and is strictly larger than the absolute value of any other eigenvalues of $A$. Furthermore:
Application to Stochastic Matrices: The Perron-Frobenius Theorem implies that if a row stochastic matrix has all positive entries, then $1$ is the dominant eigenvalue and the right- and left- eigenspaces associated to the eigenvalue $1$ are one-dimensional. Furthermore, there are both left- and right- eigenvectors with all positive entries. We already understand the right-eigenvector ${\mathbf 1}$. The theorem guarantees there is a left eigenvector with all entries positive and with eigenvalue one. By rescaling this eigenvector, we observe that there is a unique left-eigenvector with eigenvalue one which is also a probability row vector. This turns out to be the main object we are interested in.
It follows from our theorem about the power method (in the Eigenvalues and Eigenvectors
notebook) that if $x$ is a probability row vector and $S$ is a row stochastic matrix with all positive entries, then the powers $x S^k$ converge to the probability row vector which is also a left-eigenvector.
Following Wills' definition, we define the Google matrix $G$ by: $$G = \alpha S + (1-\alpha) {\mathbf 1} v,$$ where $0 < \alpha \leq 1$ and $v$ is a probability row vector. Here $\alpha$ is the damping factor and $v$ is the personalization vector. This is essentially a trick to turn our stochastic matrix $S$ into a stochastic matrix with all positive entries.
Wills says "the damping factor, $\alpha$, in the Google matrix indicates that random Web surfers move to a different webpage by some means other than selecting a link with probability $1-\alpha$." As suggested in the article we will take $\alpha=0.85$ and take $v$ to be the uniform row probability vector:
alpha = 0.85
print("α = {}".format(alpha))
v = np.array([ 5*[1/5]])
print("v = {}".format(v))
Note what ${\mathbf 1} v$ gives:
one_vector @ v
This is is a constant entry stochastic matrix. The Google matrix is a weighted average of $S$ and ${\mathbf 1} v$. (Weighted averages of stochastic matrices are still stochastic.)
G = alpha*S + (1-alpha) * one_vector @ v
G
The probabilty left-eigenvector represents the (asymptotic) portion of time our random web surfer spends on each webpage. We can compute this using the power method. By our Theorem on the power method (in the Eigenvalues and Eigenvectors
notebook), we can start with any row probability vector $x$ and $x G^k$ will converge to the left-eigenvector with eigenvalue one. Since probability vectors are invariant under right-multiplication by $G$, the limiting eigenvector will still be a probability vector (at least up to numerical errors).
We'll start with $x$ the uniform probability vector and iterate. We'll iterate until each entry of $x_k - x_{k-1}$ has absolute value less than $\epsilon$. We'll take $\epsilon=10^{-4}$ here.
epsilon=10**-4
k = 1
x_old = v # x_old will always store x_{k-1}
x_k = x_old @ G
max_difference = (abs(x_k - x_old)).max()
while max_difference >= epsilon:
k += 1
x_old = x_k
x_k = x_old @ G
max_difference = (abs(x_k - x_old)).max()
print("x_{} = {}".format(k, x_k))
Let's check that x_k
is close to an eigenvector with eigenvalue $1$:
x_k @ G - x_k
All entries are small like we hoped. Also $x_k$ should be a probability vector:
x_k @ one_vector
I used Scrapy (a python package for designing a web crawler) to crawl my own website and build a graph based on how pages are linked within my site. (I found this stackoverflow post helpful in writing code to do this. You may also need to read some documentation for scrapy.)
Remark. There are potential issues with crawling a website, which is why I decided to use my own website. In principal you should respect robots.txt
files, and be conscientious of the burden it might place on a website.
I processed the data produced by Scrapy and then stored the results in the my_website_data.py file. You need to download this file and save it in the directory containing this notebook before running the following import
command:
from my_website_data import urls, graph_data
The variable urls
stores a list of urls for webpages on my site. You can see the number of urls using:
len(urls)
I think the order is related to the order in which the pages were crawled. You can see the first $10$ webpages using:
urls[:10]
This order is a bit weird. For url[24]
stores my root webpage:
urls[24]
These are webpage_0
through webpage_9
. The variable graph_data
stores the link data. If $i \in \{0, 1, \ldots, 2085\}$, graph_data[i]
will return a list of indices of webpages linked to by webpage_i
. So for example:
graph_data[0]
This indicates that webpage_0
(which is http://wphooper.com/docs/
) links to itself, webpage_24
(my root webpage), webpage_25
, webpage_1733
, webpage_1735
, and webpage_1736
. For example, webpage_1736
is:
urls[1736]
We want to construct the hyperlink matrix $H$. We don't want to work with a $2086 \times 2086$ dense matrix, so instead we use sparse matrices. Sparse matrices are supported in the scipy
software package. We will make use of two classes:
lil_matrix
: for constructing sparse matrices incrementally (which is how we build our matrices below).csr_matrix
: for efficient matrix algebra such as matrix vector multiplication.from scipy.sparse import csr_matrix, lil_matrix
The following creates a $2086 \times 2086$ sparse matrix of all zeros.
H = lil_matrix( (len(urls), len(urls)) )
for i in range( len(urls) ):
links = graph_data[i]
l = len(links)
for j in links:
H[i, j] = 1/l
H
There are some dangling nodes. We'll implement the backlink fix. Links to webpage_i
are precisely the indices of the non-zero entries in column $i$.
S = H.copy()
for i in range( len(urls) ):
links = graph_data[i]
if len(links) == 0:
# The following command gets a list of the rows containing a non-zero
# entry in column i.
non_zero_rows = H[:, i].nonzero()[0]
for row in non_zero_rows: # iterate through the
S[i, row] = 1 / len(non_zero_rows)
S
So that we can do efficient algebra with $S$, we switch formats to csr_matrix
.
S = csr_matrix(S)
S
# Check that all rows sum to one:
for i in range( len(urls) ):
row_sum = S[i].sum()
assert abs(row_sum - 1) < 10**-14, "Error in row {}: sums to {}".format(i, row_sum)
Again our google matrix is $$G = \alpha S + (1-\alpha) {\mathbf 1} v,$$ with $\alpha=0.85$, ${\mathbf 1}$ the column vector of all ones, and $v$ the uniform row probability vector. To implement our algorithm, we need to be able to repeatedly compute $xG$ where $x$ is a row probability vector. Simplifying, $$x G = \alpha xS + (1-\alpha) x {\mathbf 1} v.$$ Note that since $x$ is a probability vector, $x {\mathbf 1}=1$. So, this simplifies to $$x G = \alpha xS + (1-\alpha) v.$$ Thus we don't need to store ${\mathbf 1}$. Also this formula is fast because $S$ is a sparse matrix.
Since $v$ is the uniform row probability vector, $(1-\alpha) v$ is the row vector all of whose entries are $\frac{1-\alpha}{n}$, where $n=2086$ is the number of webpages. So, we can can obtain $xG$ from $xS$ by multiplying all entries by $\alpha$ and then adding $\frac{1-\alpha}{n}$.
alpha = 0.85
v = np.array( [len(urls) * [1/len(urls)]] )
v
We will also use $v$ for our starting vector for iteration. Below we repeat the algorithm above used for our simple example graph. (We decreased epsilon because now entries will have size averaging to $\frac{1}{2086}$.
epsilon=10**-6
k = 1
x_old = v # x_old will always store x_{k-1}
x_k = alpha*x_old@S + (1-alpha)*v
max_difference = (abs(x_k - x_old)).max()
while max_difference >= epsilon:
k += 1
x_old = x_k
x_k = alpha*x_old@S + (1-alpha)*v
max_difference = (abs(x_k - x_old)).max()
print("x_{} = {}".format(k, x_k))
We'd like to order these entries from largest to smallest to see what PageRank thinks is the most important. First we create a list of pairs (url, probability):
pageranking = []
for i in range(len(urls)):
url = urls[i]
probability = x_k[0,i]
pageranking.append( (url,probability) )
# Print the first 10 entries:
pageranking[:10]
Now we want to sort the data (a list of pairs) by the number in position $1$ of the pair in decreasing order. We can do this with the following commands: (See Python's Sorting Howto.)
from operator import itemgetter
sorted_ranking = sorted(pageranking, key = itemgetter(1), reverse=True)
sorted_ranking[:10]
My root webpage shows up in $1$st place, but $2$nd is software API documentation created by an automatic script to document code. (Such documentation creates a large collection of webpages where almost all links from pages in the collection go to other pages in the collection. This seems to allow for a large PageRank to appear.)
It seems surprising that our course calendar appears in 6th place. This seems to be because there are a lot of links on the page to dangling nodes. Our dangling node fix seems to unfairly reward pages that link to a lot of dangling nodes.
Anyway, this wasn't meant to be a serious investigation of PageRank, but an introduction to the ideas behind it.