CollecTor

What is in the data?

The Tor network data provided here comes from five different sources which are explained in more detail on this page. You may either read through the entire page or jump to the type of data you're most interested in:

Each descriptor provided here contains an @type annotation using the format @type $descriptortype $major.$minor. Any tool that processes these descriptors may parse files without meta data or with an unknown descriptor type at its own risk, can safely parse files with known descriptor type and same major version number, and should not parse files with known descriptor type and higher major version number.

Tor relay descriptors #

Relays and directory authorities publish relay descriptors, so that clients can select relays for their paths through the Tor network. All these relay descriptors are specified in the Tor directory protocol, version 3 specification document (or in the earlier protocol versions 2 or 1). This page shall give a quick overview of what relay descriptors are available.

Server descriptors (archive, recent) @type server-descriptor 1.0

Server descriptors contain information that relays publish about themselves. Tor clients once downloaded this information, but now they use microdescriptors instead. The server descriptors in archive contain one descriptor per file, whereas the files in recent contain all descriptors collected in an hour concatenated into a single file.

Extra-info descriptors (archive, recent) @type extra-info 1.0

Extra-info descriptors contain relay information that Tor clients do not need in order to function. This is self-published, like server descriptors, but not downloaded by clients by default. The extra-info descriptors in archive contain one descriptor per file, whereas the files in recent contain all descriptors collected in an hour concatenated into a single file.

Network status consensuses (archive, recent) @type network-status-consensus-3 1.0

Though Tor relays are decentralized, the directories that track the overall network are not. These central points are called directory authorities, and every hour they publish a document called a consensus, or network status document. The consensus in turn is made up of router status entries containing flags, heuristics used for relay selection, etc.

Network status votes (archive, recent) @type network-status-vote-3 1.0

The directory authorities exchange votes every hour to come up with a common consensus. Vote documents are by far the largest documents provided here.

Directory key certificates (archive) @type dir-key-certificate-3 1.0

The directory authorities sign their votes and the consensus with their key that they publish in a key certificate. These key certificates change once every few months, so they are only available in the archive.

Microdescriptor consensuses (archive, recent) @type network-status-microdesc-consensus-3 1.0

Tor clients used to download all server descriptors of active relays, but now they only download the smaller microdescriptors which are derived from server descriptors. The microdescriptor consensus lists all active relays and references their currently used microdescriptor. The tarballs in archive contain both microdescriptor consensuses and referenced microdescriptors together.

Microdescriptors (archive, recent) @type microdescriptor 1.0

Microdescriptors are minimalistic documents that just includes the information necessary for Tor clients to work. The tarballs in archive contain both microdescriptor consensuses and referenced microdescriptors together. The microdescriptors in archive contain one descriptor per file, whereas the files in recent contain all descriptors collected in an hour concatenated into a single file.

Version 2 network statuses (archive) @type network-status-2 1.0

Version 2 network statuses have been published by the directory authorities before consensuses have been introduced. In contrast to consensuses, each directory authority published their own authoritative view on the network, and clients combined these documents locally. We stopped archiving version 2 network statuses in 2012.

Version 1 directories (archive) @type directory 1.0

The first directory protocol version combined the list of active relays with server descriptors in a single directory document. We stopped archiving version 1 directories in 2007.

Tor bridge descriptors #

Bridges and the bridge authority publish bridge descriptors that are used by censored clients to connect to the Tor network. We cannot, however, make bridge descriptors available as we do with relay descriptors, because that would defeat the purpose of making bridges hard to enumerate for censors. We therefore sanitize bridge descriptors by removing all potentially identifying information and publish sanitized versions here. The sanitizing steps are as follows:

  1. Replace bridge identities with their digests: Clients can request a bridge's current descriptor by sending its identity string to the bridge authority. This is a feature to make bridges on dynamic IP addresses useful. Therefore, the original identities (and anything that could be used to derive them) need to be removed from the descriptors. The bridge's RSA-based identity fingerprint is replaced with its SHA-1 hash value, and the bridge's optional base64-encoded Ed25519 master key is replaced with its SHA-256 digest. The idea is to have a consistent replacement that remains stable over months or even years (without keeping a secret for a keyed hash function).
  2. Remove most cryptographic keys and signatures: It would be straightforward to learn about the bridge identity from the bridge's public key. Replacing keys by newly generated ones seemed to be unnecessary (and would involve keeping a state over months/years), so that most cryptographic keys and signatures have simply been removed.
  3. Replace IP address with IP address hash: Of course, IP addresses need to be sanitized, too.
    • IPv4 addresses are replaced with 10.x.x.x with x.x.x being the 3 byte output of H(IP address | bridge identity | secret)[:3]. The input IP address is the 4-byte long binary representation of the bridge's current IP address. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is a 31-byte long secure random string that changes once per month for all descriptors and statuses published in that month. H() is SHA-256. The [:3] operator means that we pick the 3 most significant bytes of the result.
    • IPv6 addresses are replaced with [fd9f:2e19:3bcf::xx:xxxx] with xx:xxxx being the hex-formatted 3 byte output of a similar hash function as described for IPv4 addresses. The only differences are that the input IP address is 16 bytes long and the secret is only 19 bytes long.
  4. Replace contact information: If there is contact information in a descriptor, the contact line is changed to somebody.
  5. Remove pluggable transport addresses and arguments: Bridges may provide transports in addition to the onion-routing protocol and include information about these transports in their extra-info descriptors for BridgeDB. In that case, any IP addresses, TCP ports, or additional arguments are removed, only leaving in the supported transport names.
  6. Append descriptor digests: Descriptors are often referenced by their digest, but that is not possible anymore once their content has changed. As a workaround, sanitized descriptors contain a new line router-digest with the hex representation of the SHA-1 of the original descriptor digest excluding RSA signature and—if the bridge uses an Ed25519 identity—a new line router-digest-sha256 with the base64-encoded SHA-256 of the SHA-256 digest of the original descriptor including all signatures.

Network statuses (archive, recent) @type bridge-network-status 1.0

Sanitized bridge network statuses are similar to version 2 relay network statuses, but with only a published line in the header and without any lines in the footer. The tarballs in archive contain all bridge descriptors of a given month, not just network statuses.

Server descriptors (archive, recent) @type bridge-server-descriptor 1.1

Bridge server descriptors follow the same format as relay server descriptors, except for the sanitizing steps described above. The tarballs in archive contain all bridge descriptors of a given month, not just server descriptors. These tarballs contain one descriptor per file, whereas the files in recent contain all descriptors collected in an hour concatenated into a single file to reduce the number of files. The format has changed over time to accomodate changes to the sanitizing process, with earlier versions being:

Extra-info descriptors (archive, recent) @type bridge-extra-info 1.3

Bridge extra-info descriptors follow the same format as relay extra-info descriptors, except for the sanitizing steps described above. The format has changed over time to accomodate changes to the sanitizing process, with earlier versions being:

The tarballs in archive contain all bridge descriptors of a given month, not just extra-info descriptors. These tarballs contain one descriptor per file, whereas the files in recent contain all descriptors collected in an hour concatenated into a single file to reduce the number of files.

BridgeDB's bridge pool assignments #

The bridge distribution service BridgeDB publishes bridge pool assignments describing which bridges it has assigned to which distribution pool. BridgeDB receives bridge network statuses from the bridge authority, assigns these bridges to persistent distribution rings, and hands them out to bridge users. BridgeDB periodically dumps the list of running bridges with information about the rings, subrings, and file buckets to which they are assigned to a local file. The sanitized versions of these lists containing SHA-1 hashes of bridge fingerprints instead of the original fingerprints are available for statistical analysis.

Bridge pool assignments (archive) @type bridge-pool-assignment 1.0

The document below shows a BridgeDB pool assignment file from March 13, 2011. Every such file begins with a line containing the timestamp when BridgeDB wrote this file. Subsequent lines start with the SHA-1 hash of a bridge fingerprint, followed by ring, subring, and/or file bucket information. There are currently three distributor ring types in BridgeDB:

  1. unallocated: These bridges are not distributed by BridgeDB, but are either reserved for manual distribution or are written to file buckets for distribution via an external tool. If a bridge in the unallocated ring is assigned to a file bucket, this is noted by bucket=$bucketname.
  2. email: These bridges are distributed via an e-mail autoresponder. Bridges can be assigned to subrings by their OR port or relay flag which is defined by port=$port and/or flag=$flag.
  3. https: These bridges are distributed via https server. There are multiple https rings to further distribute bridges by IP address ranges, which is denoted by ring=$ring. Bridges in the https ring can also be assigned to subrings by OR port or relay flag which is defined by port=$port and/or flag=$flag.
bridge-pool-assignment 2011-03-13 14:38:03
00b834117566035736fc6bd4ece950eace8e057a unallocated
00e923e7a8d87d28954fee7503e480f3a03ce4ee email port=443 flag=stable
0103bb5b00ad3102b2dbafe9ce709a0a7c1060e4 https ring=2 port=443 flag=stable
[...]

As of December 8, 2014, bridge pool assignment files are no longer archived.

TorDNSEL's exit lists #

The exit list service TorDNSEL publishes exit lists containing the IP addresses of relays that it found when exiting through them.

Exit lists (archive, recent) @type tordnsel 1.0

Tor Check makes the list of known exits and corresponding exit IP addresses available in a specific format. The document below shows an entry of the exit list written on December 28, 2010 at 15:21:44 UTC. This entry means that the relay with fingerprint 63BA.. which published a descriptor at 07:35:55 and was contained in a version 2 network status from 08:10:11 uses two different IP addresses for exiting. The first address 91.102.152.236 was found in a test performed at 07:10:30. When looking at the corresponding server descriptor, one finds that this is also the IP address on which the relay accepts connections from inside the Tor network. A second test performed at 10:35:30 reveals that the relay also uses IP address 91.102.152.227 for exiting.

ExitNode 63BA28370F543D175173E414D5450590D73E22DC
Published 2010-12-28 07:35:55
LastStatus 2010-12-28 08:10:11
ExitAddress 91.102.152.236 2010-12-28 07:10:30
ExitAddress 91.102.152.227 2010-12-28 10:35:30

Torperf's performance data #

The performance measurement service Torperf publishes performance data from making simple HTTP requests over the Tor network. Torperf uses a trivial SOCKS client to download files of various sizes over the Tor network and notes how long substeps take.

Torperf measurement results (archive, recent) @type torperf 1.0

A Torperf results file contains a single line per Torperf run with key=value pairs. Such a result line is sufficient to learn about 1) the Tor and Torperf configuration, 2) measurement results, and 3) additional information that might help explain the results. Known keys are explained below.

The files in recent accumulate all new Torperf measurements of a given day, which means that they may change throughout the day. This is different from all other files in the recent directory which do not change once they are written.