diff --git a/README.md b/README.md index ea9f5bbe744c6bdb7401bcf6e29f300def64357e..42ff8d69ffbe084ff2bc46be31649f801af3612a 100644 --- a/README.md +++ b/README.md @@ -2,188 +2,91 @@ Realistic DNS benchmarking tool which supports multiple transport protocols: + - **DNS-over-TLS (DoT)** + - **DNS-over-HTTPS (DoH)** - UDP - TCP - - DNS-over-TLS (DoT) - - DNS-over-HTTPS (DoH) -*DNS Shotgun is capable of simulating hundreds of thousands of clients.* +*DNS Shotgun is capable of simulating hundreds of thousands of DoT/DoH +clients.* -Every client establishes its own connection when communicating over TCP-based -protocol. This makes the tool uniquely suited for realistic benchmarking since -its traffic patterns are very similar to real clients. +Every client establishes its own connection(s) when communicating over +TCP-based protocol. This makes the tool uniquely suited for realistic DoT/DoH +benchmarks since its traffic patterns are very similar to real clients. -## Current status (2020-09-14) +DNS Shotgun exports a number of statistics, such as query latencies, number of +handshakes and connections, response rate, response codes etc. in JSON format. +The toolchain also provides scripts that can plot these into readable charts. -- fully supported UDP, TCP and DNS-over-TLS with - [dnsjit](https://github.com/DNS-OARC/dnsjit) 1.0.0 -- fully supported DNS-over-HTTPS with development version of dnsjit -- traffic can be replayed only over IPv6 -- user interface - - may be unstable - - only very basic UI available - - more complex scenarios are no supported yet - (e.g. simultaneously using multiple protocols) -- pellet.py is functional, but it is very slow and requires python-dpkt from - master +## Features -## Overview +- Supports DNS over UDP, TCP, TLS and HTTP/2 +- Allows mixed-protocol simultaneous benchmark/testing +- Can bind to multiple source IP addresses +- Customizable client behaviour (idle time, TLS versions, HTTP method, ...) +- Replays captured queries over selected protocol(s) while keeping original timing +- Suitable for high-performance realistic benchmarks +- Tools to plot charts from output data to evaluate results -DNS Shotgun is capable of simulating real client behaviour by replaying -captured traffic over selected protocol(s). The timing of original queries as -well as their content is kept intact. +## Caveats -This tool requires large amount of source PCAPs. These are ideally captured -directly on your network to simulate the behaviour of your own clients. The -captured PCAPs are then pre-processed into DNS Shotgun "pellets", which are -input files that contain the selected amount of simulated clients based on the -original traffic. +- Requires captured traffic from clients +- Setup for proper benchmarks can be quite complex +- Isn't suitable for testing with very low number of clients/queries +- Backward compatibility between versions isn't kept -Realistic high-performance benchmarking requires complex setup, especially for -TCP-based protocols. However, the authors of this tool have successfully used it -to benchmark and test various DNS implementations with up to hundreds of -thousands of clients (meaning _connections_ for TCP-based transports) using -commodity hardware. +## Documentation -## Input data +[https://knot.pages.nic.cz/shotgun](https://knot.pages.nic.cz/shotgun) -To have a realistic simulation of clients, no synthetic queries are created. -Instead, an input PCAP must be provided. There are the following assumptions: +## Showcase -- Each IP address represents a unique client. -- The packets are ordered by ascending time. -- Only UDP packets arriving to port 53 are used. +The following charts highlight the unique capabilities of DNS Shotgun. +Measurements are demonstrated using DNS over TCP. In our test setup, DNS +Shotgun was able to keep sending/receiving: -The PCAP is then sliced into the requested time periods, and DNS queries are -collected for each client. The output PCAP contains the exact same queries, -only the msgid is renumbered to be sequential (to avoid issues with multiple -in-flight TCP queries with potentially the same msgid). +- 400k queries per second over +- **500k simultaneously active TCP connections**, with about +- 25k handshakes per second, which amounts to +- 1.6M total established connections during the 60s test runtime. -The input data can be created with: + + -``` -./pellet.py input.pcap -c CLIENTS -t TIME -r RESOLVER_IP -``` - -where `CLIENTS` is the number of required clients and `TIME` is the selected -time period. `RESOLVER_IP` is necessary to extract only the traffic towards the -resolver and not other upstream servers. +### Socket statistics on server -## Replaying the traffic - -### UDP - -``` -./shotgun.lua -P udp -p 53 -s "::1" pellets.pcap ``` +# ss -s +Total: 498799 (kernel 0) +TCP: 498678 (estab 498466, closed 52, orphaned 0, synrecv 0, timewait 54/0), ports 0 -### TCP - -``` -./shotgun.lua -P tcp -p 53 -s "::1" pellets.pcap -./shotgun.lua -P tcp -p 53 -s "::1" -e 0 pellets.pcap # no idle timeout -``` - -### DNS-over-TLS (DoT) - +Transport Total IP IPv6 +* 0 - - +RAW 4 1 3 +UDP 19 2 17 +TCP 498626 5 498621 +INET 498649 8 498641 +FRAG 0 0 0 ``` -./shotgun.lua -P dot -p 853 -s "::1" pellets.pcap -./shotgun.lua -P dot -p 853 -s "::1" --tls-priority "NORMAL:-VERS-ALL:+VERS-TLS1.3" pellets.pcap -./shotgun.lua -P dot -p 853 -s "::1" --tls-priority "NORMAL:%NO_TICKETS" pellets.pcap -``` - -### DNS-over-HTTPS (DoH) - -``` -./shotgun.lua -P doh -p 443 -s "::1" --tls-priority "NORMAL:-VERS-ALL:+VERS-TLS1.3" pellets.pcap -./shotgun.lua -P doh -p 443 -s "::1" --tls-priority "NORMAL:-VERS-ALL:+VERS-TLS1.3" -M POST pellets.pcap -``` - -### High-performance benchmarking - -``` -./shotgun.lua \ - -P tcp \ - -s "fd00:dead:beef::cafe" \ - -T 15 \ - --bind-pattern "fd00:dead:beef::%x" \ - --bind-num 8 \ - pellets.pcap -``` - -To be able to scale-up to hundreds of thousands of TCP connections, multiple -source IP addresses are needed. It's possible to utilize [unique-local -addresses](https://en.wikipedia.org/wiki/Unique_local_address) in IPv6. Our rule -of thumb is to use one IP per every 30k clients (when the port range is extended -to allow 60k ephemeral ports). - -Check out the kernel documentation for tuning the network stack for TCP. Other tips: - -``` -ulimit -n 1000000 -sysctl -w net.ipv4.ip_local_port_range="1025 60999" -stsctl -w net.core.rmem_default="8192000" -``` - -The entire setup process is quite complex and repetitive when taking multiple -measurements. There is some ansible automation for DNS Shotgun in the -[resolver-benchmarking](https://gitlab.nic.cz/knot/resolver-benchmarking) -repository. - -## Docker container - -For ease of use, docker container with shotgun is available. Note that running -``--privileged`` can improve its performance by a few percent, if you don't mind -the security risk. - -``` -docker run registry.nic.cz/knot/shotgun:v20200914 --help -``` - -The following example can be used to test the prototype to simulate UDP clients. - -Process captured PCAP and extract clients 50k clients within 30 seconds of traffic: - -``` -docker run \ - -v "$PWD:/data:rw" \ - registry.nic.cz/knot/shotgun/pellet:v20200914 \ - -o /data/pellets.pcap \ - -c 1000 \ - -t 10 \ - -r $RESOLVER_IP \ - /data/captured.pcap -``` - -Replay the clients against IPv6 localhost server: - -``` -docker run \ - --network host \ - -v "$PWD:/data:rw" \ - registry.nic.cz/knot/shotgun:v20200914 \ - -O /data \ - -s "::1" \ - /data/pellets.pcap -``` - -## Interpreting the results -DNS Shotgun's output is one JSON file per every thread. These can be merged -together and then various plots describing the latencies, connection statistics -etc. can be generated using our utility scripts in the `tools/` directory. +### Test setup -## Dependencies +- DNS over TCP against [TCP echo server](https://gitlab.nic.cz/knot/echo-server) +- two physical servers: one for DNS Shotgun, another for the echo server +- both servers have 16 CPUs, 32 GB RAM, 10GbE network card (up to 64 queues) +- servers were connected directly to each other - no latency +- TCP network stack was tuned and there was no firewall -When using the sources, the following dependencies are needed. +## License -### pellet.py +DNS Shotgun is released under GPLv3 or later. -- python3 -- python-dpkt (latest from git, commit 2c6aada35 or newer) -- python-dnspython +## Thanks -### shotgun.lua +We'd like to thank the [Comcast Innovation +Fund](https://innovationfund.comcast.com) for sponsoring the work to support +the use of TCP, DoT and DoH protocols. -- dnsjit 1.0.0 for UDP, TCP and DoT -- development version of dnsjit for DoH +DNS Shogun is built of top of the [dnsjit](https://github.com/DNS-OARC/dnsjit) +engine. We'd like to thank DNS-OARC and Jerry Lundström for the development and +continued support of dnsjit. diff --git a/docs/analyzing-clients.md b/docs/analyzing-clients.md new file mode 100644 index 0000000000000000000000000000000000000000..2842603c6c8f1babc62e25310da8b96fdec66779 --- /dev/null +++ b/docs/analyzing-clients.md @@ -0,0 +1,59 @@ +# Analyzing Clients + +When you've created a pellets file that is ready to use for DNS Shotgun replay, +you may want to verify you didn't distort the original client population. There +is a tool that can be used to compare client distribution and activity between +the original traffic capture and the pellets file. + +!!! note + This steps is optional and may not be neccessary for larger client + populations or for client populations with similar behaviour. Nevertheless, + it's better to check your assumptions. + +First, you need to run client analysis script for both the original capture (or +rather the `filtered.pcap` file) and the processed pellets file. + +``` +$ pcap/count-packets-per-ip.lua -r filtered.pcap --csv filtered.csv +$ pcap/count-packets-per-ip.lua -r pellets.pcap --csv pellets.csv +``` + +Then, you can use another tool to plot a chart of these results. + +``` +$ tools/plot-client-distribution.py -o clients.png filtered.csv pellets.csv +``` + +## Client distribution chart + +The following charts demonstrates how queries are distributed among clients. It +can be used to read how active are your clients or how many overall queries +your resolver receives from which clients. + +!!! warning + The following chart displays absolute number of queries, not QPS. When + comparing multiple distributions, always make sure to use PCAPs of the same + duration. + + + +There are several blobs on the chart that represent groups of clients. The area +of the blob visually signifies the total amount of queries that were received +from these clients. + +For each blob, you can locate its center and read the X and Y axes values. +Please note that both axis are logarithmic. On the Y-axis you can read the mean +number of queries that a client represented in the blob has sent. On the +X-axis, you can read the percentage of clients that are represented by this +blob. + +In the example above, the first blob from the left shows that almost 80 % of +clients send less than 10 queries. Around 20 % of clients send between 10 and +100 queries. Even though the remaining clients represent around 1 % of the +total client population, we can see that these clients generate significant +query traffic. + +The comparison shows the two samples are quite similar. In case these +differences are significant, you may want to consider changes to pellets files. +If you used `pcap/limit-clients.lua` to generate these, using a different +`-s/--seed` might help. diff --git a/docs/capturing-traffic.md b/docs/capturing-traffic.md new file mode 100644 index 0000000000000000000000000000000000000000..1a571bdc876562658f54680f0124f087a1cce7fa --- /dev/null +++ b/docs/capturing-traffic.md @@ -0,0 +1,61 @@ +# Capturing Traffic + +When replaying traffic using DNS Shotgun, you need to provide it with a PCAP +that contains extracted client data, or "*pellets*". You may not use an +arbitrary PCAP file. Instead, you must pre-process the raw PCAP capture into +pellets as described in the following sections. + +!!! note + DNS Shotgun's measurements are only as good as the data you feed it. + Quality of input data that most accurately represents your clients is + crucial for realistic benchmarking. Results can vary greatly for different + client populations. + +## Raw capture assumptions + +To start, you need a traffic capture from your network to work with. It only +needs to contain UDP DNS queries from clients towards your resolver. Other +traffic may be present as well, but it will be filtered out. + +### Packets must be sorted by increasing timestamp + +Some network or hardware conditions may cause the packets to appear in +different order. To ensure correct order, use the `reodercap` command from +tshark/wireshark. + +``` +$ reordercap raw.pcap ordered.pcap +``` + +### Unique IP means unique client + +Client needs to be somehow identified in the captured traffic. We decided to +use IP address to tell clients apart. This should be a reasonable assumption, +unless your clients are behind NAT. + +!!! warning + If your real clients are behind NAT, this has major consequences and should + be acounted for, since multiple real clients will be bundled in a single + simulated one. + +### Only UDP packets are used + +If large number of your clients already use DoT, DoH or TCP, you need to +somehow get their queries into plain UDP format. For example, Knot Resolver can +[mirror](https://knot-resolver.readthedocs.io/en/v5.2.1/modules-policy.html#policy.MIRROR) +incoming queries to UDP. + +## Filtering DNS queries + +In this step, UDP DNS queries from clients are extracted from the raw PCAP. If +the raw capture includes queries from resolver to upstream servers, it is +_crucial_ to provide the script with resolver IP address(es) to filter out +outgoing queries. + +``` +$ pcap/filter-dnsq.lua -r ordered.pcap -w filtered.pcap -a $RESOLVER_IP +``` + +!!! tip + You may also use this script to work with traffic directly captured from + interface chosen with `-i`. See `--help` for usage. diff --git a/docs/configuration-file.md b/docs/configuration-file.md new file mode 100644 index 0000000000000000000000000000000000000000..8f1d8f4b94e5cb01bd650260fc36008443ae4cdc --- /dev/null +++ b/docs/configuration-file.md @@ -0,0 +1,158 @@ +# Configuration File + +!!! tip + You can find configuration files for presets in + [`config/`](https://gitlab.nic.cz/knot/shotgun/-/tree/master/config). They + are an excellent starting point to create your own configurations. + +Configuration is written in [TOML](https://toml.io/en/). There are multiple sections that may have additional subsections. + +- `[traffic]` contains one or more subsections that each define client behaviour, including protocol +- `[charts]` is an optional section which can contain subsections that define charts that should be automatically plotted +- `[defaults.traffic]` is an optional section that makes it possible specify defaults shared by all traffic senders + +## [traffic] section + +You can define one or more traffic senders with specific client behaviour. Every traffic sender has a name and may have multiple parameters. At the very least, each traffic sender must define `protocol`. + +This is an example of minimal configuration file sending all traffic as DNS-over-TLS using defaults for everything. The name of the traffic sender here is "DoT". + +``` +[traffic] +[traffic.DoT] +protocol = "dot" +``` + +The following configuration parameters for traffic senders are supported. + +### protocol + +- `udp`: DNS over UDP +- `tcp`: DNS over TCP +- `dot`: DNS over TLS over TCP +- `doh`: DNS over HTTP/2 over TLS over TCP + +### weight + +When multiple traffic senders are defined, weight affects the client +distribution between them. Weight is relative to the sum of all weights. + +Integer or float. Defaults to 1. + + +### idle_time_s + +Determines whether clients keep the connection in idle state, i.e. leaving it +established after they have received all answers and currently have no more +queries to send. Idle time of 0 means the client will close the connection as +soon as possible. + +Integer. Defaults to 10 seconds. + +### gnutls_priority + +[GnuTLS priority string](https://gnutls.org/manual/html_node/Priority-Strings.html) +which can be used to select TLS protocol version and features, for example: + +``` +gnutls_priority = "NORMAL:%NO_TICKETS" # don't use TLS Session Resumption +gnutls_priority = "NORMAL:-VERS-ALL:+VERS-TLS1.3" # only use TLS 1.3 +``` + +String. Defaults to `NORMAL` which is determined by the system's GnuTLS library. + +### http_method + +- `GET` +- `POST` + +### timeout_s + +Individual query timeout in seconds. + +Integer. Defaults to 2 seconds. + +!!! warning + Increasing the query timeout can negatively impact DNS Shotgun's + performance and is not recommended. + +### handshake_timeout_s + +Timeout for establishing a connection in seconds. + +Integer. Defaults to 5 seconds. + +### Advanced settings + +You shouldn't use these unless you need to. + +- `cpu_factor`: override the default CPU thread distribution (UDP: 1, TCP:2, DoT/DoH: 3) +- `max_clients`: number of clients each dnssim instance can hold (per-thread settings) +- `channel_size`: number of queries that can be buffered before thread starts to block +- `batch_size`: number of queries processed in each loop + +### CLI overrides + +The following options can be used to override the CLI options for `replay.py`. +Values in configuration file always take precedence before CLI options. + +- `server`: target server's IPv4/IPv6 address +- `dns_port`: target server's port for plain DNS (UDP and TCP) +- `dot_port`: target server's port for DNS-over-TLS +- `doh_port`: target server's port for DNS-over-HTTPS + +## [charts] section + +This section is optional and is only provided as a convenience to automate +plotting charts after the test. Anything defined in this section can be +achieved by using the plotting scripts directly. + +Similarly to the `[traffic]` section, it also contains named subsections. Every +such subsection must contain `type` which determines the charts that should be +plotted. For example: + +``` +[charts] +[charts.response-rate] +type = "response-rate" +``` + +### type + +Type determines which chart will be plotted. The following charts are supported: + +- `response-rate`: [Response Rate Chart](response-rate-chart.md) +- `latency`: [Latency Histogram](latency-histogram.md) +- `connections`: [Connection Chart](connection-chart.md) + +### title + +Title of the chart. + +### output + +Output filename for the chart. Various file extensions can be used. Defaults to using svg. + +### Other parameters + +These depend on the specific chart type. Generally, any option that can be +passed directly to the plotting scripts can also be specified in the config. +Refer to the tools `--help` for possible options. + +## [defaults] section + +### [defaults.traffic] section + +This section can provide defaults for all traffic senders. If a specific +traffic sender re-defines the same parameter, the traffic sender-specific value +takes precedence before the default value. + +Any parameter that can be specified for traffic senders in `[traffic]` section +can also be specified in this section. For example, to override the default +behavior to not use TLS Session Resumption, you can use: + +``` +[defaults] +[defaults.traffic] +gnutls_priority = "NORMAL:%NO_TICKETS" +``` diff --git a/docs/configuration-presets.md b/docs/configuration-presets.md new file mode 100644 index 0000000000000000000000000000000000000000..c707148a4f012ee4d5b9333c310303786d2b8d12 --- /dev/null +++ b/docs/configuration-presets.md @@ -0,0 +1,33 @@ +# Configuration Presets + +You can either use a configuration preset or create your own configuration. It +is possible to replay the original traffic over various different protocols +with different client behaviours simultaneously. For example, you can split +your traffic into 60 % UDP, 20 % DoT and 20 % DoH. + +There are the following predefined use-cases for simplicity of use without the +need to create a configuration file. You can pass these values instead of +filepath to `-c/--config` option of `replay.py` utility. + +- `udp` + - 100 % DNS-over-UDP clients +- `tcp` + - 100 % well-behaved DNS-over-TCP clients +- `dot` + - 100 % well-behaved DNS-over-TLS clients using TLS Session Resumption +- `doh` + - 50 % well-behaved DNS-over-HTTPS GET clients using TLS Session Resumption + - 50 % well-behaved DNS-over-HTTPS POST clients using TLS Session Resumption +- `mixed` + - 60 % DNS-over-UDP clients + - 5 % well-behaved DNS-over-TCP clients + - 5 % aggressive DNS-over-TCP clients + - 10 % well-behaved DNS-over-TLS clients using TLS Session Resumption + - 5 % well-behaved DNS-over-TLS clients without TLS Session Resumption + - 10 % well-behaved DNS-over-HTTPS GET clients using TLS Session Resumption + - 5 % well-behaved DNS-over-TLS POST clients using TLS Session Resumption + +!!! note + You can find configuration files for presets in + [`config/`](https://gitlab.nic.cz/knot/shotgun/-/tree/master/config). They + are an excellent starting point to create your own configurations. diff --git a/docs/connection-chart.md b/docs/connection-chart.md new file mode 100644 index 0000000000000000000000000000000000000000..6b4aa3c0fcb76b6326a3095a38fa48af2b90011f --- /dev/null +++ b/docs/connection-chart.md @@ -0,0 +1,28 @@ +# Connection Chart + +The connection chart can be used to visualize connection-related information, +such as the number of active established connections, handshake attempts, +successful TLS Session Resumptions or failed handshakes. + +``` +$ tools/plot-connections.py -k active -- DoT.json +$ tools/plot-connections.py -k tcp_hs tls_resumed failed_hs -t "Handshakes over Time" DoT.json +``` + +The optional parameter `-k/--kind` can be used to select which data should be +plotted. The following values are supported. + +- `active` means the number of currently active established connections +- `tcp_hs` means the number of TCP handshake attempts in the last second +- `failed_hs` means the number of failed handshakes. All kinds of connection + setup failures will be included, whether it's TCP handshake timeout, TLS + negotiation failure or anything else. +- `tls_resumed` means the number of connection that were resumed with TLS + Session Resumption during the last second + +!!! tip + Using the `--` to separate a list of JSON files after specifying + `-k/--kind` might be needed in some cases. + + + diff --git a/docs/extracting-clients.md b/docs/extracting-clients.md new file mode 100644 index 0000000000000000000000000000000000000000..50d517a2c80e1cd2c823596de4568633089281e4 --- /dev/null +++ b/docs/extracting-clients.md @@ -0,0 +1,59 @@ +# Extracting Clients + +Once you have the `filtered.pcap` with DNS queries from clients, you can +process them into *pellets* - the pre-processed input files for DNS Shotgun. +All the content of these files will be used during the replay stage - all +clients for the entire duration of the file. + +The following example takes the entire `filtered.pcap` and transforms it into +pellets. The pellets file will contain all the clients and it will have the +same duration as the original file. + +``` +$ pcap/extract-clients.lua -r filtered.pcap -O $OUTPUT_DIR +``` + +The produced pellets file is ready to be used as the input for DNS Shotgun +replay. + +## Splitting original capture into multiple pellets files + +It can be useful to have a long original capture file, which contains more +clients and queries. However, since the pellets file will be replayed in its +entirety, you may want to split the original file into multiple pellets files +with shorter duration. + +For example, if your initial capture file is 30 minutes long and you could +split it into fifteen two minute pellets files with the `-d/--duration` option. + +``` +$ pcap/extract-clients.lua -r filtered.pcap -O $OUTPUT_DIR -d 120 +``` + +!!! tip + Is it useful to keep a collection of these original pellets files of same + duration. They can be later combined to create different test cases. + +## Scaling-up the traffic + +If you want to stress-test your infrastructure, you can combine these pellets +files together to effectively scale-up the traffic. The pellets files are +created in a way that you can simply use `mergecap` utility to combine them. + +``` +$ mergecap -w scaled.pcap $OUTPUT_DIR/* +``` + +## Limiting the traffic + +It is also possible to take a pellets file and scale-down its traffic. This is +done on a per-client basis. Either client's entire query stream will be +present, or the client won't be present at all. + +To limit the overall traffic, you can select the portion of the clients that +should be included. This can range from 0 to 1. For example, let's suppose we +want to scale-down the number of clients in the pellets file to 30 %. + +``` +$ pcap/limit-clients.lua -r pellets.pcap -w limited.pcap -l 0.3 +``` diff --git a/docs/img/clients.png b/docs/img/clients.png new file mode 100644 index 0000000000000000000000000000000000000000..22bb4962003550192e611fd7d48d2d67d92599f8 Binary files /dev/null and b/docs/img/clients.png differ diff --git a/docs/img/connections.png b/docs/img/connections.png new file mode 100644 index 0000000000000000000000000000000000000000..ce47935881138bd3ea592fdd755c4222f010b6b7 Binary files /dev/null and b/docs/img/connections.png differ diff --git a/docs/img/handshakes.png b/docs/img/handshakes.png new file mode 100644 index 0000000000000000000000000000000000000000..c6dd6746e8cc7dd972a6e8de64090b54febf1edb Binary files /dev/null and b/docs/img/handshakes.png differ diff --git a/docs/img/latency.png b/docs/img/latency.png new file mode 100644 index 0000000000000000000000000000000000000000..306b894469f221c0d43508c6b28a362f4663ba40 Binary files /dev/null and b/docs/img/latency.png differ diff --git a/docs/img/response-rate.png b/docs/img/response-rate.png new file mode 100644 index 0000000000000000000000000000000000000000..93fac17ae87ba58e3c65450d01c39c19e1104098 Binary files /dev/null and b/docs/img/response-rate.png differ diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000000000000000000000000000000000000..372c266344992627f29188e690db1d868b79047f --- /dev/null +++ b/docs/index.md @@ -0,0 +1,40 @@ +# DNS Shotgun + +Realistic DNS benchmarking tool which supports multiple transport protocols: + + - **DNS-over-TLS (DoT)** + - **DNS-over-HTTPS (DoH)** + - UDP + - TCP + +*DNS Shotgun is capable of simulating hundreds of thousands of DoT/DoH +clients.* + +Every client establishes its own connection(s) when communicating over +TCP-based protocol. This makes the tool uniquely suited for realistic DoT/DoH +benchmarks since its traffic patterns are very similar to real clients. + +DNS Shotgun exports a number of statistics, such as query latencies, number of +handshakes and connections, response rate, response codes etc. in JSON format. +The toolchain also provides scripts that can plot these into readable charts. + +## Features + +- Supports DNS over UDP, TCP, TLS and HTTP/2 +- Allows mixed-protocol simultaneous benchmark/testing +- Can bind to multiple source IP addresses +- Customizable client behaviour (idle time, TLS versions, HTTP method, ...) +- Replays captured queries over selected protocol(s) while keeping original timing +- Suitable for high-performance realistic benchmarks +- Tools to plot charts from output data to evaluate results + +## Caveats + +- Requires captured traffic from clients +- Setup for proper benchmarks can be quite complex +- Isn't suitable for testing with very low number of clients/queries +- Backward compatibility between versions isn't kept + +## Code Repository + +[https://gitlab.nic.cz/knot/shotgun](https://gitlab.nic.cz/knot/shotgun) diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 0000000000000000000000000000000000000000..fab12b1c4a8a9672ad6cb55ad28315c9c2e174e6 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,53 @@ +# Installation + +There are two options for using DNS Shotgun. You can either install the +dependencies and use the scripts from the repository directly, or use a +pre-built docker image. + +## Using script directly + +You can use the toolchain scripts directly from the git repository. You need to +ensure you have the required dependencies installed. Also make sure to check +out some tagged version, as the development happens in master branch. + +``` +$ git clone https://gitlab.nic.cz/knot/shotgun.git +$ git checkout v20210203 +``` + +### Dependencies + +When using the scripts directly, the following dependencies are needed. If you +only wish to process shotgun JSON output (e.g. plot charts), then dnsjit isn't +required. + +- [dnsjit](https://github.com/DNS-OARC/dnsjit): Can be installed from [DNS-OARC + repositories](https://dev.dns-oarc.net/packages/). +- Python 3.6 or later +- Python dependencies from [requirements.txt](https://gitlab.nic.cz/knot/shotgun/-/blob/master/requirements.txt) +- (optional) tshark/wireshark for some PCAP pre-processing + +## Docker Image + +Pre-built image can be obtained from [CZ.NIC DNS Shotgun +Registry](https://gitlab.nic.cz/knot/shotgun/container_registry/65). + +``` +$ docker pull registry.nic.cz/knot/shotgun:v20210203 +``` + +Alternately, you can build the image yourself from Dockerfile in the repository. + +### Docker Usage + +- Make sure to run with `--network host`. +- Mount input/output directories and files with `-v/--volume`. +- Using `--privileged` might slightly improve performance if you don't mind the security risk. + +``` +$ docker run \ + --network host \ + -v "$PWD:/mnt" \ + registry.nic.cz/knot/shotgun:v20210203 \ + $COMMAND +``` diff --git a/docs/key-concepts.md b/docs/key-concepts.md new file mode 100644 index 0000000000000000000000000000000000000000..a3ae24ea19fe3860190334c0406707797f6e15c4 --- /dev/null +++ b/docs/key-concepts.md @@ -0,0 +1,70 @@ +# Key Concepts + +DNS Shotgun is capable of simulating real client behaviour by replaying +captured traffic over selected protocol(s). The timing of original queries as +well as their content is kept intact. + +Realistic high-performance benchmarking requires complex setup, especially for +TCP-based protocols. However, the authors of this tool have successfully used it +to benchmark and test various DNS implementations with up to hundreds of +thousands of clients (meaning _connections_ for TCP-based transports) using +commodity hardware. This requires [performance tuning](performance-tuning.md) +that is described in later section. + +## Client + +These docs often mention "*client*" and we often use it to describe DNS +infrastructure throughput in addition to queries per second (QPS). What is a +considered a client and why does it matter? + +A client is the origin of one or more queries and it is supposed to represent a +single device, i.e. anything from a CPE such as home/office router to a mobile +device. Since traffic patterns of various devices can vary greatly, it is +crucial to use traffic that most accurately represents your real clients. + +In plain DNS sent over UDP the concept of client doesn't matter, since UDP is a +stateless protocol and a packet is just a packet. Thus, QPS throughput may be +sufficient metric for UDP. + +In stateful DNS protocols, such as DoT, DoH or TCP, much of the overhead and +performance cost is caused by establishing the connection over which queries +are subsequently sent. Therefore, the concept of client becomes crucial for +benchmarking stateful protocols. + +!!! note + As an extreme example, consider 10k QPS sent over a single DoH connection + versus establishing a 10k DoH connections, each with 1 QPS. While both + scenarios have the same overall QPS, the second one will consume vastly more + resources, especially when establishing the connections. + +### Client replay guarantees + +DNS Shotgun aims to provide the most realistic client behaviour when replaying +the traffic. When you run DNS Shotgun, there are the following guarantees when +using a stateful protocol. + +- **Multiple clients never share a single connection.** +- **Each client attempts to establish at least one connection.** +- **A client may have zero, one or more (rarely) active established connections + at any time**, depending on its traffic and behavior. + +## Real traffic + +A key focus of this toolchain is to make the benchmarks as realistic as +possible. Therefore, no synthetic queries or clients are generated. To +effectively use this tool, you need to have large amount of source PCAPs. +Ideally, these contain the traffic from your own network. + +!!! note + In case you'd prefer to use synthetic client/queries anyway, you can just + generate the traffic and capture it in PCAP for further processing. Doing that + is outside of the scope of this documentation. + +### Traffic replay guarantees + +- **Content of DNS messages is left intact.** Messages without proper DNS header + or question section will be discarded. +- **Timing of the DNS messages is kept as close to the original traffic as + possible.** If the tool detects time skew larger than one second, it aborts the + test. However, real time difference may be slightly longer due to various + buffers. diff --git a/docs/latency-histogram.md b/docs/latency-histogram.md new file mode 100644 index 0000000000000000000000000000000000000000..d513e263696470508d514cd0a587cb5ffc714d88 --- /dev/null +++ b/docs/latency-histogram.md @@ -0,0 +1,40 @@ +# Latency Histogram + +This very useful chart is a bit difficult to read and understand, but it +provides a great deal of information about the overall latency from client side +perspective. We use the logarithmic percentile histogram to display this data. +[This +article](https://blog.powerdns.com/2017/11/02/dns-performance-metrics-the-logarithmic-percentile-histogram/) +provides an in-depth explanation about the chart and how to interpret it. + +``` +$ tools/plot-latency.py -t "DNS Latency Overhead" UDP.json TCP.json DoT.json DoH.json +``` + + + +The chart above illustrates why comparing just the response rate isn't a +sufficient metric. For all protocols compared in this case, you'd get around +99.5 % response rate. However, when you examine the client latency, you can see +clear differences. + +In the chart, 80 % of all queries are represented by the rightmost part of the +chart - between the "slowest percentile" of 20 % and 100 %. For these +queries, the latency for UDP, TCP, DoT or DoH is the same, which is one +round trip. These represent immediate answers from the resolver (e.g. cached or +refused), which are sent either over UDP or over an already established +connection (for stateful protocols). The latency is 10 ms, or 1 RTT. + +The most interesting part is between the 5 % and 20 % slowest percentile. For +these 15 % of all queries, there are major differences between the latency of +UDP, TCP and DoT/DoH. This illustrates the latency cost of setting up a +connection where none is present. UDP is stateless and requires just 1 RTT. TCP +requires an extra round trip to establish the connection and the latency for the +client becomes 2 RTTs. Finally, both DoT and DoH require an additional round +trip for the TLS handshake and thus the overall latency cost becomes 3 RTTs. + +The trailing 5 % of queries show no difference between protocols, since these +are queries that aren't answered from cache and the delay is introduced by the +communication between the resolver and the upstream servers. The last 0.5 % of +queries aren't answered by the resolver within 2 seconds and are considered a +timeout by the client. diff --git a/docs/performance-tuning.md b/docs/performance-tuning.md new file mode 100644 index 0000000000000000000000000000000000000000..cc34185e5076377a71d6972ab25c160821f6ee86 --- /dev/null +++ b/docs/performance-tuning.md @@ -0,0 +1,66 @@ +# Performance Tuning + +Any high-performance benchmark setup requires separate server for generating +traffic which then sends the traffic to the target server under test. In order +to scale-up DNS Shotgun to be able to perform well under heavy load, some +performance tuning and network adjustments are needed. + +!!! tip + An example of performance tuning we use in our benchmarks can be found in + our [ansible + role](https://gitlab.nic.cz/knot/resolver-benchmarking/-/tree/master/roles/tuning). + +## Number of file descriptors + +Make sure the number of available file descriptors is sufficient. It's +typically necessary when running DNS Shotgun from terminal. When using docker, +the defaults are usually sufficient. + +``` +$ ulimit -n 1000000 +``` + +## Ephemeral port range + +Extending the ephemeral port range gives the tool more outgoing ports to work with. + +``` +$ sysctl -w net.ipv4.ip_local_port_range="1025 60999" +``` + +## NIC queues + +High-end network cards typically has multiple queues. Ideally, you want to set +their number to be the same as number of available CPUs. + +``` +$ ethtool -L $INTERFACE combined $NCPU +``` + +!!! note + It's important that the NIC interrupts from different queues are handled + by different CPUs. If there are throughput issues, you may want to verify + this is the case. + +## UDP + +DNS Shotgun can generate quite bursty traffic. Increasing the receiving +server's socket memory can help to prevent that. If this buffer isn't +sufficient, it can cause packet loss. + +``` +$ sysctl -w net.core.rmem_default="8192000" +``` + +## TCP, DoT, DoH + +Tuning the network stack for TCP isn't as straightforward and it's network-card +specific. It's best to refer to [kernel +documentation](https://www.kernel.org/doc/html/latest/networking/device_drivers/ethernet/intel/ixgb.html#improving-performance) +for your specific network card. + +## conntrack + +For our benchmarks, we don't use iptables or any firewall. Especially the +`conntrack` module probably won't be able to handle serious load. Make sure the +conntrack module isn't loaded by kernel if you're not using it. diff --git a/docs/raw-output.md b/docs/raw-output.md new file mode 100644 index 0000000000000000000000000000000000000000..1000a06898eea9f3deb3f41096c7f7878955c2ae --- /dev/null +++ b/docs/raw-output.md @@ -0,0 +1,61 @@ +# Raw Output + +In the output directory of DNS Shotgun's `replay.py` tool, the following +structure is created. Let's assume we ran a configuration that configure two +traffic senders - `DoT` and `DoH`. + +``` +$OUTDIR +├── .config # ignore this directory +│ └── luaconfig.lua # for debugging purposes only +├── data # directory with raw JSON output +│ ├── DoH # "DoH" traffic sender data +│ │ ├── DoH-01.json # raw data from first thread of DoH traffic sender +│ │ ├── DoH-02.json # raw data from second thread of DoH traffic sender +│ │ └── ... # raw data from other threads of DoH traffic sender +│ ├── DoH.json # merged raw data from all DoH sender threads +│ ├── DoT # "DoT" traffic sender data +│ │ ├── DoT-01.json # raw data from first thread of DoT traffic sender +│ │ ├── DoT-02.json # raw data from second thread of DoT traffic sender +│ │ └── ... # raw data from other threads of DoT traffic sender +│ └── DoT.json # merged raw data from all DoT sender threads +└── charts # directory with automatically plotted charts (if configured) + ├── latency.svg # chart comparing latency of DoT and DoH clients + └── response-rate.svg # chart comparing the response rate of DoT and DoH clients +``` + +## data directory + +This directory contains the raw JSON data. Since DNS Shotgun typically operates +with multiple threads, the results for each traffic sender are also provided +per each thread. However, since you typically don't care about the clients were +emulated, but only about their aggregate behaviour, a data file that contains +the combined results of all threads belonging to the configured traffic sender +is also provided. + +Every configured traffic sender will have its own output directory of the same +name. Inside, per-thread raw data are available. The aggregate file is directly +in the `data/` directory as JSON file with the name of the configured traffic +sender. The aggregate file is the one you typically want to use. + +!!! note + The raw JSON file is versioned and is not intended to be forward or + backward compatible with various DNS Shotgun versions. You should use the + same version of the toolchain for both replay and interpreting the data. + +!!! tip + If you wish to explore, format or interpret the raw JSON data, + [jq](https://stedolan.github.io/jq/) utility can be useful for some + rudimentary processing. + +## charts directory + +This directory may not be present if you didn't configure any charts to be +automatically plotted in the configuration file. If it is available, it +contains the plotted charts that are described in the following sections. + +When charts are plotted automatically, they always display data for all the +configure traffic senders with their predefined names. If you wish to customize +it, omit certain senders etc., you can use the plotting scripts +directly from CLI. These can be found in the `tools/` directory and you can +refer to their `--help` for usage. diff --git a/docs/replaying-traffic.md b/docs/replaying-traffic.md new file mode 100644 index 0000000000000000000000000000000000000000..503ad247d898f8d54552e2e341bd83a29fe94be5 --- /dev/null +++ b/docs/replaying-traffic.md @@ -0,0 +1,83 @@ +# Replaying Traffic + +Once you've prepared the input pellets file with clients and either have you +own configuration file or know which present you want to use, you can the the +following scripts to run DNS Shotgun. + +``` +$ replay.py -r pellets.pcap -c udp -s ::1 +``` + +!!! tip + Use the `--help` option to explore other options. + +During the replay, there is quite a bit of logging information that look like +this. + +``` +UDP-01 notice: total processed: 267; answers: 0; discarded: 2; ongoing: 172 +``` + +The important thing to look out for is the number of `discarded` packets. In +case nearly all the packets are discarded or a large portion of them, it almost +certainly indicates some improper setup or input data. The test should be +aborted and the reason should be investigated. Increasing the `-v/--verbosity` +level might help. + +## Binding to multiple source addresses + +When sending traffic against a single IP/port combination of the target server, +the source IP address has a limited number of ports it can utilize. A single +IP address is insufficient to achieve hundreds of thousands of clients. + + +DNS Shotgun can bind to multiple sources addresses with the `-b/--bind-net` +option. You can specify either IP address or a newtork range using CIDR +notation. Multiple values (either IPs, ranges or any combination of those) can +be specified. When using CIDR notation, the network and broadcast address won't +be used. + + +``` +$ replay.py -r pellets.pcap -c tcp -s fd00:dead:beef::cafe -b fd00:dead:beef::/124 +``` + +!!! tip + Our rule of thumb is to use at least one source IP address per every 30k + clients. However, using more addresses is certainly better and can help to + avoid weird behaviour, slow performance and other issues that require + in-depth troubleshooting. + +!!! note + If you're limited by the number of source addresses you can use, utilizing + either IPv6 unique-local addresses (fd00::/8) or private IPv4 ranges could + be helpful. + +## Emulating link latency + +!!! warning + This is an advanced topic and emulating latency isn't necessary for many + scenarios. + +Overall latency will affect the user's experience with DNS resolution. It also +becomes much more relevant when using TCP and TLS, since the handshakes +introduce additional round trips. When benchmarks are done in the data center +with two servers that are directly connected to each other with practically no +latency, it can provide a skewed view of the expected end user latency. + +Luckily, the `netem` Network Emulator makes it very simple to emulate various +network conditions. For example, emulating latency on the sender side can be +done quite easily. The following command adds 10 ms latency to outgoing +packets, effectively simulating RTT of 10 ms. + +``` +$ tc qdisc add dev $INTERFACE root netem limit 10000000 delay 10ms +``` + +!!! tip + For more possibilities, refer to `man netem.8`. Using a sufficiently large + buffer (limit) is essential for proper operation. + +However, beware that the settings affect the entire interface. If you're going +to emulate latency, it's best if the resolver-client traffic is on a separate +interface, so the resolver-upstream traffic isn't negatively impacted. diff --git a/docs/response-rate-chart.md b/docs/response-rate-chart.md new file mode 100644 index 0000000000000000000000000000000000000000..13033660b55392fafa0f84431fd3929e6ce26e1c --- /dev/null +++ b/docs/response-rate-chart.md @@ -0,0 +1,20 @@ +# Response Rate Chart + +This basic chart can display the overall response rate over time. It is also +possible to plot specific error codes, such as `NOERROR`. + +``` +$ tools/plot-response-rate.py -r 0 -o rr.png UDP.json +``` + +!!! tip + The image format depends on the output filename extension chosen with can + `-o/--output`. `svg` is used by default, but other formats such as `png` + are supported as well. + +The following chart displays the answer rate and the rate of `NOERROR` answers. +In this measurement, the resolver was started with a cold cache. We can see the +overall response rate is close to 100 %. The `NOERROR` response rate slightly +increases over time from 72 % to around 75 % as the cache warms up. + + diff --git a/docs/showcase/connections.png b/docs/showcase/connections.png new file mode 100644 index 0000000000000000000000000000000000000000..912de06772e5186771b1c80540e82a6be32e231a Binary files /dev/null and b/docs/showcase/connections.png differ diff --git a/docs/showcase/handshakes.png b/docs/showcase/handshakes.png new file mode 100644 index 0000000000000000000000000000000000000000..a8e5a50f8ff7eb34344eeec77ee12e28d6d552ce Binary files /dev/null and b/docs/showcase/handshakes.png differ diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000000000000000000000000000000000000..ecb0f9f93828b028a027d0402bfacd9b92942020 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,37 @@ +# Troubleshooting + +## failed to send udp packet: too many open files + +Increase the number of file descriptors. + +## fatal: aborting, real time drifted ahead of simulated time + +This happens when DNS Shotgun can't keep up with the traffic it's supposed to +send/receive. The tool attempts to keep realistic timing from the original data +and it just aborts it if fails to keep that promise. This can have multiple +causes. + +- You're pushing the tool beyond the limits of what it can do, e.g.: + - Not enough computing power (are all CPUs utilized?) + - Insufficient network throughput (is network tuned properly? are there enough source IPs?) + - Unresponsive resolver and/or too high `timeout_s` +- NIC interrupts aren't properly distributed among CPUs +- A single thread is assigned too much traffic + - This typically shouldn't be the case, but if specific traffic sender is + *always* causing this failure, tweaking `cpu_factor` and/or number of + threads might help + +## critical: buffer capacity exceeded, threads are blocked + +This is an indication that a specific thread filled up its buffer and is now +causing the entire tool to slow down which will eventually cause the crash +described above if it goes on for too long. If it only happens for a specific +traffic sender, tweaking `cpu_factor` to change thread distribution could help. + +## various warnings + +Especially under heavy load, there can occasionally be some warnings. +Sometimes it's a GnuTLS connection error, a mismatched response etc. The general +rule is a few different warnings during heavy load probably isn't something to +be too concerned about. Typically, it's when the output is spammed by the same +warning over and over that you have a problem. diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000000000000000000000000000000000000..624ea42810fa3fc586abc53f101af37e0f02d119 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,25 @@ +site_name: DNS Shotgun +theme: + name: readthedocs + navigation_depth: 1 +nav: + - "Overview": index.md + - installation.md + - key-concepts.md + - "Input Data": + - capturing-traffic.md + - extracting-clients.md + - analyzing-clients.md + - "Replay": + - configuration-presets.md + - configuration-file.md + - replaying-traffic.md + - performance-tuning.md + - troubleshooting.md + - "Interpreting Results": + - raw-output.md + - response-rate-chart.md + - latency-histogram.md + - connection-chart.md +markdown_extensions: + - admonition