Research & Development

We Bid Goodbye to Alexa Rankings, and Measure Its Contribution to the Tranco List (Pre-May)

Posted 1 month ago   |   9 min read
header@1x
Rocky Moss - deepsee.io CEO
Rocky Moss
Chief Executive Officer
Edward Krueger - deepsee.io Chief Data Scientist
Edward Krueger
Chief Data Scientist

Enjoying the article?

We protect against this kind of threat and many more.

Reach out if you are interested in improving your campaign outcomes.

Introduction

In this post, we’ll examine if a Tranco List without Alexa Rank data would be suitable for marketers looking for a new source of site performance data.

Would they find the switch over to be jarring? Would the sites they saw on Alexa Rank lists still be present? If so, what sort of consistency could they expect with respect to how sites rank relative to each other?

We’ll also examine if there are alternative data sources that can fill in the gaps between Alexa and Tranco.

For Marketers Unconcerned With the Technical Findings

If the technical & academic research doesn’t appeal to you, you can head to the Conclusions section, where there are open questions you can help contribute answers to!

Saying Goodbye to Alexa Ranks

On May 1st, Alexa.com retired its API and web lookup tools. Most importantly, it stopped providing Alexa Rank reports, which are commonly used as a measure of website popularity.

The methodology for creating that rank is described at a high level on the Alexa.com blog, which is no longer up (a cached copy is linked):

Alexa rank is calculated using a proprietary methodology that combines a site’s estimated traffic and visitor engagement over the past three months. Traffic and engagement are estimated from the browsing behavior of people in our global panel, which is a sample of all Internet users.”

-“What is Alexa Rank?” from the Alexa.com blog

For many years, researchers and marketers from a variety of disciplines relied on Alexa.com’s site rank data to inform them of trends, and to identify opportunities. Its recommended uses include:

  • Evaluating a site’s commercial potential
  • Checking to see if a site’s traffic is rising or falling
  • Finding potential affiliates

You should consider checking out this post (“7 Ways to Use Alexa Rankings to Grow Your Business”) for more insight into how marketers might have used this data to grow their businesses.

Fraud researchers have an additional concern: to prepare for the next exploit, it’s very important for us to understand what sites people actually visit.

What is the Tranco List, and Why do we Use It?

At deepsee.io, we profile the behaviors of websites, and the relations they have with each other, but we can’t estimate how many people visit a site from this experiential approach. For that reason, one of the few external data sources we consume is the Tranco List, “A Research-Oriented Top Sites Ranking Hardened Against Manipulation”.

It combines several ranking sources in order to create a single list that gives researchers a wider view of the websites people visit.

Prior to May 1st, 2022, the list was composed of the following 3 data sources:

  • Alexa.com: based on page visits reported by a user panel and on a tracking script
  • Majestic Million: a link-based ranking system
  • Cisco Umbrella: based on the number of IPs requesting a certain domain, sourced from Cisco’s database of DNS traffic

List Generation Methodology

We compared three lists that were generated using the Tranco API on April 5th 2022. This API allows users to choose the data sources they incorporate, which was perfect for the purposes of our analysis. We are providing links to the analysis datasets for transparency:

Note: Tranco recently announced that Domaintools would be allowing them to incorporate their Farsight ranking system into the Tranco List. This means that there won’t be a “Tranco Without Alexa” list exactly like we use for forecasting purposes in this article.

The Tranco List will include Alexa.com data as available (once unavailable, it will decay off their list over 30 days), and is now additionally informed by Farsight passive DNS data. Still, we see the analysis of the Tranco List’s makeup pre-May as informative to those trying to understand the scope and scale of various website ranking systems. We will conduct additional analysis of the list that includes Farsight data once a sufficient sample of data is available.

Results of Our Analysis

Huge Portions of the List Are Sourced Exclusively from Alexa Data

Of the 7,552,154 unique domains on the “normal” Tranco list, nearly 81% of them appeared on the Alexa list, and were not otherwise present on either the Cisco Umbrella or Majestic Million lists.

Even though each list is <=1,000,000 items daily, over 30 days we saw many more unique entries on the Alexa list. This suggests that it’s more volatile than the Cisco Umbrella & Majestic Million lists, but also gives us a wider view of the web’s surface.

Interesting Exceptions

We also repeated this analysis for various subsets of our data, and found some interesting exceptions:

In contrast to the list at large, this subset of 375,277 domains is much more likely to be found across multiple data sources. Given that the Majestic Million list is primarily based on backlinks, this result could be expected.

Sites Loading the Google Publisher Tag (GPT) Script

Of 118,132 unique domains we crawled which loaded the GPT.js script, the majority of them are sourced from multiple lists. This makes the sites with GPT distinct from the general pool of “sites with ads”, which didn’t look much different from the overall figure (that’s why the chart is not included here, it’s not an outlier).

The Same Sites Are Ranked Quite Differently Between Alexa and TWA

~96% of the Alexa top 10,000 would be present somewhere within the Tranco file without Alexa, however, they are scattered far and wide (not concentrated at the top). To find 9,500 of the top 10k ranked sites from Alexa.com data, you’d have to expand your search to the top 1,330,000 domains in the TWA list.

The rank reshuffling is further demonstrated by the following chart, which shows how much ranks have changed for sites that exist on both the Alexa & TWA top million lists.

For sites that appear on both the Alexa top million and the TWA top million, ~48% of sites saw their rank move 200k or more.
For sites that appear on both the Alexa top million and the TWA top million, ~68% of sites saw their rank move 100k or more.

Quantifying Rank Bucket Similarity

The above table shows a comparison of the Alexa top N vs the TWA top N

This section describes the differences in the makeup of top “N” site lists (top 10k, top 100k, etc…). This is a common way that media buyers consider the popularity of sites.

This tells us that only 32% of the Alexa top 10k are in the TWA top 10k, and that number holds steady all the way up to a million.

Clearly, the rank thresholds aren’t transferable; someone targeting the top 10k Alexa rank sites couldn’t just apply the same rank threshold if they were using a tranco list without Alexa.

Analyzing Relative Rank Preservation with an Analysis of Site Pairs

Another way to compare two different sets of rankings is to analyze how sites compare to each other across the various lists being compared.

Consider three sites: A, B, and C.
The Alexa ranks for those sites are:

  • A: 100
  • B: 250
  • C: 1,000

The TWA ranks for those sites are:

  • A: 50
  • B: 100
  • C: 10,000

For both lists, A>B>C, so they are relatively 100% similar with respect to these 3 sites.

We can determine this by looking at all unique pairs that can be constructed from those 3 sites:

  • A:B – Agreed
    • Alexa: A (100) is ranked better than B (250)
    • TWA: A(50) is ranked better than B (100)
  • B:C – Agreed
    • Alexa: B (100) is ranked better than B (250)
    • TWA: B (100) is ranked better than C (10,000)
  • A:C – Agreed
    • Alexa: A (100) is ranked better than C (1,000)
    • TWA: A (50) is ranked better than C (10,000)

3/3 pairs agree, so pairwise similarity is 100%

Now, Imagine that we change the TWA rankings for those sites to the following:

  • A: 50
  • B: 100
  • C: 75

The comparison of all unique pairs now changes:

  • A:B – Agreed
    • Alexa: A (100) is ranked better than B (250)
    • TWA: A(50) is ranked better than B (100)
  • B:C – Disagreed
    • Alexa: B (100) is ranked better than B (250)
    • TWA: B (100) is ranked worse than C (75)
  • A:C – Agreed
    • Alexa: A (100) is ranked better than C (1,000)
    • TWA: A (50) is ranked better than C (75)

This time, only 2/3 pairs agree, so pairwise similarity is 66.6%

This is the basic logic that fuels the following data points:

  • Of pairs of sites that appear on both the Alexa top 10k and the TWA top 10k, ~61% preserve the same rank order
    • 3,177 sites were on both lists, and each was compared to the other as a pair.
      • Using the formula N(N-1)/2 to find the number of unique pairs analyzed, we find that 10,093,329 pairs were analyzed.
  • The same can be said for sites on both the Alexa top 100k and the TWA top 100k; ~61% preserve the same rank order
    • 32,558 sites were on both lists, and each was compared to the other as a pair.
      • Using the formula N(N-1)/2 to find the number of unique pairs analyzed, we find that 529,995,403 pairs were analyzed.

Conclusions & Takeaways

The most applicable conclusions affecting marketers:

  • Someone targeting the top 10k Alexa rank sites couldn’t just apply the same rank threshold if they were using a tranco list without Alexa.
  • Alexa ranks were more volatile than the Cisco Umbrella & Majestic Million ranks, but also gave us a wider view of the web’s surface.
  • For researchers to stay ahead of the next threat, a source of ranking data that accounts for site-visit metrics is likely imperative

As we mentioned in the introduction, there won’t be a “Tranco Without Alexa” list exactly like we use for forecasting purposes in this article. Due to the way that Tranco is permitted to use Farsight data, it’s not possible to create custom lists including/excluding that dataset. Basically, that means we won’t be able to see exactly how well the Farsight data fills the gaps left by Alexa.com’s exit.

All that said, this study has highlighted to us the importance of including a data source based on page visits reported by some user panel, or tracking script. So much of what’s known about the web’s surface would be lost without the availability of such measurement systems.

Realistically, there are very few companies that can match the breadth of knowledge that Alexa.com had acquired. It makes sense that Amazon, whose ad business is booming, would want to keep those insights close to the chest as a competitive advantage.

It is generous of Domaintools to contribute their Farsight passive DNS dataset to the open-source research efforts of the Tranco team, but it has yet to be seen how DNS requests relate to site visits. This is an admitted blind spot, as the Domaintools team puts it:

“[…]since this is only seeing traffic that would go to the internet, if the organization’s nameservers already have a domain cached, that request won’t be seen in this feed.”

Mirror, Mirror, on the Wall, Who’s the Fairest (website) of Them all?
Aaron Gee-Clough, Senior Data Engineer @ Domaintools

It’s an open question as to what datasets can give researchers & marketers the best view into what sites are actually getting visited the most. A few top contenders come to mind:

  • SimilarWeb: data sourced from a multiplicity of useful POVs; likely to achieve most parity with Alexa.com’s methodology
    • Like Alexa.com, this data is partially sourced by a panel of users with their extension installed
    • Voluntary submission of data from publishers who connect their Google Analytics accounts
    • Partnerships with ISPs and DSPs
  • Ahrefs: data is based on clickstream data & backlinks gathered from their massive crawling efforts
    • The organic search data (clickstream) can be extremely helpful in determining the most used sites.
    • The backlinks data may be comparable to the Majestic data, so it could be expected that this slice of their data wouldn’t contribute many new unique sites to our view of the web’s surface.
  • BuzzSumo: data sourced from social media engagements
    • Along with search, social media is a huge engine for content discovery.
    • Someone with a view of what’s being shared & engaged with on social media can contribute much to the way researchers understand the web’s surface.

Obviously, companies like Google & Facebook could contribute a huge amount of knowledge to this topic, but it can’t be expected that these companies would suddenly open up their historically opaque data to the public.

At this point we turn to you, our readers, to help us understand what you consider the most useful measure of site popularity. Is it one we listed above, or a product we hadn’t considered? What kind of methodology do they use? How has it helped you achieve your business goals?

If you using Alexa.com ranking data, how have your processes changed since moving to a new data source?

We’d love to hear all about it on Twitter or LinkedIn. Not about socials? E-mail us a line at [email protected]

Ad fraud is serious business.

Let us help you understand the threat.

Additional articles you may enjoy.

Research & Development
November 25, 2020
Research & Development
February 24, 2021