5002 Web Database Memo
Authority weight (in-degree) Hub weight (out-degree)
- A good authority has many edges from good hubs
- A good hub has many edges from good authorities
Hit Algorithm
Two steps:
- Sampling Step
- Iteration Step
sampling step
Given a user query with several terms, we collect a set of pages that are very relevant – called the base set.
- We get all the web page that contain any query terms. This is called root set.
- We find the link pages, which are linked by or link to web pages of root set.
- Base set = link pages & root set
Iteration Step
We know the hub weight of a web page is given by: $h(N) = a(N) + a(MS) + a(A)$ etc.. so we can get: $\vec{h} = [{h(N), h(MS), h(A)}]$ $\vec{a} = [{a(N), a(MS), a(A)}]$ $\vec{h} = M\vec{a}$ Here M represents adjacent matrix.
Similarly, we get $\vec{a} = M^T\vec{h}$ So we get: $\vec{h} = MM^T\vec{h}$ $\vec{a} = M^TM\vec{a}$
How to interation?
- We set all the pages with hub weight = 1 at first.
- And we do $\vec{h} = MM^T\vec{h}$, to get new vec h.
- And we normalize it to make the sum of all the element in the vector to be equal to the number of elements. For example, normalize to 3.
Recommendation
- We can rank base on h weight
- We can rank base on a weight
- We can rank base on a combination of h & a weight (for example, sum)
Pagerank Algorithm
Comparison with HIT algorithm
- Disadvantage of hit: It can be hard to determine to use a or h.
- Advantage of Pagerank: only one concept for ranking
Ideas
Stochastic Matrix
- column-wise Matrix, column is decided by the out-link of a node. And the sum of the column elements equal to 1.
Like: N -> N N -> A A -> N A -> M M -> A
We can get: ![[Pasted image 20250515235758.png]]
Iteration
After we get the stochastic matrix
- Initial the vec r to be [1, …, 1].
- $\vec{r} = M \vec{r}$
- so on so forth We need to understand that it’s actually the weighted sum of columns, so the sum of elements equal to the num of elements. ![[Pasted image 20250516000110.png]]
Spider Trap
Definition: A group of one or more pages that have no links out of the group will eventually accumulate all the importance of the web.
How to avoid it? $\vec{r} = 0.8W\vec{r} + \vec{c}$ $\vec{c} = \begin{bmatrix} 0.2 \ 0.2 \ 0.2\end{bmatrix}$