Understanding Larry Page's Pagerank Paper and Its Impact

Author

Reads 407

A smartphone displaying the Google homepage on a wooden surface, viewed from above.
Credit: pexels.com, A smartphone displaying the Google homepage on a wooden surface, viewed from above.

Larry Page's Pagerank paper revolutionized the way we search online, and it's still influencing the internet today.

The paper, titled "The Anatomy of a Large-Scale Hypertextual Web Search Engine", was written by Larry Page and Sergey Brin in 1998.

This paper introduced the concept of PageRank, a system for ranking web pages based on their importance.

PageRank works by analyzing the number and quality of links pointing to a page, with more important pages passing on more "rank" to their linked pages.

The algorithm was designed to combat spam and provide more accurate search results.

Google's PageRank Algorithm

Google's PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. This algorithm can be calculated for collections of documents of any size.

A probability is expressed as a numeric value between 0 and 1, with a 0.5 probability being a 50% chance of something happening. A document with a PageRank of 0.5 means there's a 50% chance that a person clicking on a random link will be directed to that document.

Google's founders reported that the PageRank algorithm for a network consisting of 322 million links converges to within a tolerable limit in 52 iterations. The convergence in a network of half the above size took approximately 45 iterations.

Algorithm

Credit: youtube.com, How Google's PageRank Algorithm Works

The PageRank algorithm is a probability distribution that represents the likelihood of a person randomly clicking on links and arriving at a particular page. It can be calculated for collections of documents of any size.

A probability is expressed as a numeric value between 0 and 1, where a 0.5 probability means there's a 50% chance of something happening. This is how you'd interpret a document with a PageRank of 0.5.

The PageRank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value. In fact, Google's founders reported that the algorithm converges to within a tolerable limit in 52 iterations for a network consisting of 322 million links.

The convergence speed of the algorithm is impressive, especially for extremely large networks. Google's founders found that the scaling factor for such networks would be roughly linear in log n, where n is the size of the network.

Google Directory

Credit: youtube.com, Google’s PageRank Algorithm Explained with Examples | Georgia Tech CSE6242

Google Directory was a feature that displayed a green bar to show a website's PageRank. The bar was an 8-unit measurement, but never showed the actual numeric value.

The Google Directory was a separate entity from the Google Toolbar, which showed a numeric PageRank value upon mouseover of the green bar.

Google Directory was closed on July 20, 2011, marking the end of this feature.

Usage

PageRank is still used by Google, albeit in a modified form. It's believed that the score is available to search engineers within Google, despite public access being removed in 2016.

Google engineers have suggested that the original form of PageRank was replaced with a new approximation that requires less processing power to calculate. This new approximation is less important in how Google ranks pages.

PageRank remains a constant for each web page, and it's likely embedded in many of Google's systems to this day.

The Google Dance

The Google Dance was a real challenge for SEO pros back in the day. The Google Dance was a phenomenon where the Google Search Engine Results Pages (SERPs) would move up and down during the calculation of PageRank.

Credit: youtube.com, Google PageRank Dead, RankBrain Confuses & The Google Dance

The math behind PageRank was simple, but it needed to be iteratively processed, running multiple times over every page and every link on the Internet. This process took several days to complete.

The Google Dance was notorious for stopping SEO pros in their tracks every time Google started its monthly update. The erratic changes to the SERPs made it difficult to plan and execute SEO strategies.

The Google Dance was so infamous that it later became the name of an annual party that Google ran for SEO experts at its headquarters in Mountain View.

The Retreat

Google's confidence in its algorithm didn't last. Its internal belief that PageRank was "unspam-able" was shattered as the backlink industry grew.

The company started to rely less on PageRank and more on other methods to index the world's information. This shift was influenced by the purchase of MetaWeb and its proprietary Knowledge Graph, also known as "Freebase" in 2014.

Credit: youtube.com, PageRank: A Trillion Dollar Algorithm

Google continued to rely on PageRank for its ranking algorithms, but it was no longer visible to the public. The PageRank Toolbar was withdrawn by 2016, marking the beginning of the end of public access to PageRank.

By this time, SEO professionals had already found ways to correlate their own calculations with PageRank, thanks to tools like Majestic.

How PageRank Works

PageRank is a complex algorithm that helps determine the importance of a webpage on the internet. Initially, every page is given an estimated PageRank score, which can be any number.

The formula for PageRank involves dividing the current PageRank by the number of links out of the page, resulting in a smaller fraction. This fraction is then distributed to the linked pages.

The PageRank is then updated by summing up the fractions of pages that link into each given page. This process is repeated until the PageRank scores reach a settled equilibrium.

Credit: youtube.com, Invention of PageRank: Larry Page's method for node ranking in a linked database

The damping factor, which represents the chance that a person might stop surfing, is also a crucial part of the formula. It reduces the proposed new PageRank before each subsequent iteration.

The formula can be expressed mathematically as follows:

PR = (1 - d) / n + d * ∑ (PR_j / C_j)

Where:

  • PR = PageRank in the next iteration of the algorithm.
  • d = damping factor.
  • n = total number of pages on the Internet.
  • j = the page number on the Internet.

This formula is the foundation of the PageRank algorithm, which has been instrumental in shaping the way we search and navigate the internet.

PageRank Computation

PageRank computation can be done either iteratively or algebraically. The iterative method is essentially the power iteration method or power method, which involves a series of mathematical operations.

The basic operations performed in the iterative method include matrix multiplication and vector addition. The matrix A denotes the adjacency matrix of the graph, and K is the diagonal matrix with the outdegrees in the diagonal.

The power method can be used to compute the principal eigenvector of the matrix M^{\displaystyle {\widehat {\mathcal {M}}}}, which is equivalent to PageRank R. This method involves starting with an arbitrary vector x(0) and iteratively applying the operator M^{\displaystyle {\widehat {\mathcal {M}}}}.

Credit: youtube.com, M4ML - Linear Algebra - 5.7 Introduction to PageRank

The power method converges to the principal eigenvector in a number of iterations that is roughly linear in log⁡ ⁡ n{\displaystyle \log n}, where n is the size of the network. For example, Google's founders reported that the PageRank algorithm for a network of 322 million links converged to within a tolerable limit in 52 iterations.

A distributed algorithm for PageRank computation has been described by Sarma et al. This algorithm takes O(log⁡ ⁡ n/ϵ ϵ ){\displaystyle O(\log n/\epsilon )} rounds with high probability on any graph, where n is the network size and ϵ ϵ {\displaystyle \epsilon } is the reset probability.

The algorithm can be used to compute PageRank of nodes in a network. Each node processes and sends a number of bits per round that are polylogarithmic in n, the network size.

Here are some key facts about the power method:

The PageRank algorithm outputs a probability distribution that represents the likelihood that a person randomly clicking on links will arrive at any particular page. The algorithm requires several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

PageRank Models

Credit: youtube.com, PageRank and the Random Surfer Model

PageRank Models are a crucial part of understanding how web pages are ranked. There are different models, but one of them is the Directed Surfer Model.

This model is based on a query-dependent PageRank score of a page, which is a function of the query. A surfer selects a term from the query according to some probability distribution and uses that term to guide its behavior for a large number of steps.

The resulting distribution over visited web pages is called QD-PageRank.

PageRank in Google

The Google Directory PageRank was an 8-unit measurement that was displayed as a bar, but never showed numeric values. This was a key part of Google's algorithm, but it's worth noting that Google Directory itself was closed on July 20, 2011.

The Google Directory PageRank was an important step in Google's algorithm, but it was just the beginning. The algorithm itself was revolutionary, and it's what set Google apart from other search engines at the time.

The Google Dance was a real phenomenon that SEO pros experienced every time Google started its monthly update. This was because the math behind PageRank needed to be iteratively processed, which took several days to run and caused the Google SERPs to move up and down erratically.

How Search Revolutionized

Credit: youtube.com, 5.1 DS: Google's Ranking Revolution: The Fascinating History

Google's PageRank algorithm was revolutionary because it analyzed the links between pages, not just the content on each page individually. This approach helped identify influential pages and avoid those with manipulative text.

Other search engines of the time relied heavily on analyzing content, making them easy to manipulate by SEO pros. This led to a flawed retrieval method.

Google's combination of PageRank and nGrams helped establish a winning formula. nGrams, a relatively simple concept, played a crucial role in establishing relevancy.

Google soon overtook AltaVista and Inktomi, which powered MSN among others. This was a significant shift in the search landscape.

Google's page-level approach proved more scalable than the directory-based approach used by Yahoo and DMOZ.

Toolbar vs.

The Toolbar PageRank was a score between 0 and 10 for every page on the Internet, making it easy to assess the importance of any page.

This visibility came with complications, as it was clear that links were the easiest way to "game" Google.

Credit: youtube.com, Barry Schwartz Explains the Impact of Page Rank in Google's Toolbar - Kalicube Knowledge Nuggets

The more links, or better links, the better a page could rank in Google's SERPs for any targeted keyword.

A secondary market formed, buying and selling links valued on the PageRank of the URL where the link was sold.

Yahoo launched a free tool called Yahoo Search Explorer, allowing anyone to find links into any given page.

Moz and Majestic built on this free option by creating their own indexes and evaluating links separately.

Other search engines relied heavily on analyzing the content on each page individually.

These methods had little to identify the difference between an influential page and one simply written with random (or manipulative) text. Google's PageRank algorithm was revolutionary because it addressed this issue.

Combined with a relatively simple concept of “nGrams” to help establish relevancy, Google found a winning formula.

It soon overtook the main incumbents of the day, such as AltaVista and Inktomi (which powered MSN, amongst others).

Google's PageRank algorithm operated at a page level, providing a much more scalable solution than the “directory” based approach adopted by Yahoo and later DMOZ.

Web Authorities

Credit: youtube.com, Larry Page: Search Algorithm Genius #LarryPage #SearchAlgorithm #GoogleAlgorithm #PageRank

Google's algorithm uses a concept called "PageRank" to determine the importance of a webpage. This is based on the idea that a page is more likely to be important if many other important pages link to it.

PageRank is a mathematical formula that assigns a score to each webpage based on the number and quality of links pointing to it. The score is then used to rank pages in search engine results.

Larry Page and Sergey Brin, the founders of Google, developed PageRank as a way to improve the accuracy of their search engine. They wanted to create a system that could identify the most relevant and trustworthy sources of information.

The PageRank score is calculated by analyzing the link structure of the web and assigning a score to each webpage based on its importance. This score is then used to rank pages in search engine results.

Google's algorithm uses a concept called "damping factor" to prevent the score from becoming too high or too low. This helps to ensure that the ranking is more accurate and reliable.

Credit: youtube.com, Larry Page Google Co-Founder/Alphabet/The PageRank Algorithm

Larry Page and Sergey Brin also developed a system called "Google's index" to store and organize the web pages. This system allows Google to quickly retrieve and rank web pages based on their relevance and importance.

PageRank is an important factor in Google's algorithm, but it's not the only one. Google also uses other factors such as keyword usage, content quality, and user experience to determine the relevance and importance of a webpage.

PageRank Patents and Research

PageRank patents and research provide a solid foundation for understanding the algorithm's development.

The original PageRank U.S. Patent, Method for node ranking in a linked database, was filed on September 4, 2001, and granted patent number 6,285,999.

Research papers like The PageRank Citation Ranking: Bringing Order to the Web and The Anatomy of a Large-Scale Hypertextual Web Search Engine offer in-depth insights into the PageRank algorithm.

These resources are essential for anyone looking to dive deeper into the world of PageRank.

Here are some key patents and papers to get you started:

  • Method for node ranking in a linked database (Patent number 6,285,999)
  • The PageRank Citation Ranking: Bringing Order to the Web
  • The Anatomy of a Large-Scale Hypertextual Web Search Engine

Manipulating

Credit: youtube.com, PageRank Patent Update 2018

Google has publicly warned webmasters that selling links for the purpose of conferring PageRank and reputation will result in devalued links. This means that if you're caught trying to game the system, your links won't be worth much.

The practice of buying and selling links is a hotly debated topic among webmasters. Google has advised webmasters to use the nofollow HTML attribute value on paid links.

In 2019, Google introduced new types of tags that don't pass PageRank: rel="ugc" for user-generated content and rel="sponsored" for advertisements or sponsored content.

Buying high PageRank links can be an effective marketing strategy, but it's not without its risks.

Scientific Research

PageRank has been widely used in scientific research to quantify the impact of researchers. It's used to create a ranking system for individual publications, which then propagates to individual authors.

This ranking system is known as the pagerank-index (Pi), and it's been shown to be fairer compared to the h-index. The h-index has several drawbacks, making Pi a more reliable measure.

Credit: youtube.com, Page Ranking and Search Engines - Computerphile

PageRank is also a useful tool in the analysis of protein networks in biology. It helps identify key proteins and their interactions, which can lead to new discoveries.

In ecosystems, a modified version of PageRank can be used to determine species that are essential to the environment. This is done by analyzing the relationships between species and their impact on the ecosystem.

A newer use of PageRank is to rank academic doctoral programs based on their records of placing graduates in faculty positions. This is done by analyzing the connections between departments and their hiring practices.

PageRank has also been proposed as a replacement for the traditional Institute for Scientific Information (ISI) impact factor. This is implemented by Eigenfactor and SCImago, which take into account the "importance" of each citation.

Here's a list of some of the ways PageRank is being used in scientific research:

  • Quantifying the scientific impact of researchers
  • Ranking academic doctoral programs
  • Identifying key species in ecosystems
  • Analyzing protein networks in biology
  • Replacing the traditional ISI impact factor

PageRank and SEO

PageRank is a crucial component of search engine optimization (SEO), and it's directly related to the search engine results page (SERP) rank of a web page. The SERP rank is determined by a combination of factors, including relevance, reputation, authority, and popularity.

Credit: youtube.com, SEO Tutorial - Understanding PageRank

The PageRank of a webpage is non-keyword specific, and it's an indication of Google's assessment of the reputation of a webpage. However, numerous other factors now affect ranking a business in Local Business Results, especially after the introduction of Google Places.

Google uses a combination of webpage and website authority to determine the overall authority of a webpage competing for a keyword, with the PageRank of the HomePage of a website being the best indication of website authority.

Serp Rank

The SERP rank of a web page refers to its placement on the search engine results page, where higher placement means higher SERP rank. This ranking is influenced by a combination of factors, not just PageRank.

Google uses a combination of webpage and website authority to determine the overall authority of a webpage competing for a keyword. PageRank is Google's indication of its assessment of the reputation of a webpage.

Positioning of a webpage on Google SERPs for a keyword depends on relevance and reputation, also known as authority and popularity. The PageRank of the HomePage of a website is the best indication Google offers for website authority.

A whopping 200 or more factors influence the SERP rank of a web page, making it a complex and dynamic process.

No Follow

Credit: youtube.com, What's The Difference Between Do Follow and No Follow Links

Google implemented the "nofollow" value for HTML link and anchor elements in early 2005 to combat spamdexing.

The nofollow relationship was added to prevent websites from artificially inflating their PageRank by creating many message-board posts with links to their website.

Webmasters can use the nofollow attribute to manually control the flow of PageRank among pages within a website, a tactic known as PageRank Sculpting.

This tactic involves strategically placing the nofollow attribute on certain internal links to funnel PageRank towards those pages deemed most important.

However, Google announced that blocking PageRank transfer with nofollow does not redirect that PageRank to other links, making this tactic potentially less effective.

Jeannie Larson

Senior Assigning Editor

Jeannie Larson is a seasoned Assigning Editor with a keen eye for compelling content. With a passion for storytelling, she has curated articles on a wide range of topics, from technology to lifestyle. Jeannie's expertise lies in assigning and editing articles that resonate with diverse audiences.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.