|
|
Current hardware
|
|
|
================
|
|
|
|
|
|
| Hostname | Status | Cluster Role | CPU | RAM | Note |
|
|
|
| Hostname | Status | MAIN Cluster Role | CPU | RAM | Note |
|
|
|
|--------------------------------|-----------------------|----------------------------|--------------|-------|----------|
|
|
|
| gondor-resolver.labs.nic.cz | :white\_check\_mark: | :warning: **submit**, exec | 4 @ 2.40GHz | 16 GB | hw, Brno |
|
|
|
| rivendell-resolver.labs.nic.cz | :white\_check\_mark: | :gear: exec | 4 @ 2.40GHz | 16 GB | hw, Brno |
|
... | ... | @@ -10,14 +10,36 @@ Current hardware |
|
|
|
|
|
General
|
|
|
-------
|
|
|
- login as user `respdiff` (with your gitlab ssh key)
|
|
|
- machines are managed with Ansible: [knot-resolver-ansible](https://gitlab.labs.nic.cz/knot/knot-resolver-ansible)
|
|
|
|
|
|
Condor
|
|
|
------
|
|
|
- machines are part of a [*HTCondor cluster*](http://research.cs.wisc.edu/htcondor/)
|
|
|
- CI uses the `MAIN` cluster
|
|
|
- machine's current cluster is in MOTD
|
|
|
- *do not turn off condor* (or the machine) for **submit** role (cluster functioning and GitLab CI depends on it)
|
|
|
- daily update/reboot happens at 2:30
|
|
|
- login as user `respdiff` (with your gitlab ssh key)
|
|
|
- read MOTD for basic usage
|
|
|
- condor *can* be turned off for non-essential machines (all except **submit** role), see below
|
|
|
- machines are managed with Ansible: [knot-resolver-ansible](https://gitlab.labs.nic.cz/knot/knot-resolver-ansible)
|
|
|
- detached cluster can be created for other testing/development (see [knot-resolver-ansible](https://gitlab.labs.nic.cz/knot/knot-resolver-ansible))
|
|
|
- few useful commands:
|
|
|
|
|
|
```
|
|
|
condor_q # on submit machine - display current queue
|
|
|
condor_status # list machines in cluster
|
|
|
condor_q -c 'ClusterId==42` # list matching jobs; operators <=, <, >, >= also supported
|
|
|
condor_rm -c 'ClusterId==42` # removes matching jobs - make sure to use condor_q to check first
|
|
|
condor_rm -a # remove ALL jobs - for use in detached cluster, use caution
|
|
|
```
|
|
|
|
|
|
Automatic Events
|
|
|
----------------
|
|
|
- check current status in MOTD
|
|
|
- `autoupdate.timer` triggers a daily update/reboot at 2:30
|
|
|
- `autorespdiff.timer`
|
|
|
- creates and updates reference data for current master
|
|
|
- runs regularly on submit machine(s)
|
|
|
- keeps adding jobs with `-p 0` (default priority is `5`) and updates reference afterwards
|
|
|
- deletes reports older than 3 days from reference
|
|
|
|
|
|
Networking
|
|
|
----------
|
... | ... | @@ -48,7 +70,7 @@ Executing respdiff |
|
|
3. save
|
|
|
4. run schedule manually
|
|
|
3. *manual, directly from* **submit** *machine*
|
|
|
- example in MOTD
|
|
|
- basic example: `respdiff-job-submit $(respdiff-job-create 88e78c66)`
|
|
|
- `respdiff-job-create --help`
|
|
|
- `respdiff-job-submit --help`
|
|
|
- works with knot-resolver-security as well
|
... | ... | @@ -57,13 +79,11 @@ Executing respdiff |
|
|
|
|
|
Using machines for other testing/development
|
|
|
--------------------------------------------
|
|
|
|
|
|
- `rohan` is currently not part of the cluster and can be used freely
|
|
|
- any machine except **submit** can be temporarily removed from the cluster and used for other workloads
|
|
|
- **HOWTO (temporarily remove machine from cluster)**:
|
|
|
1. `condor_off`: removes machine from cluster once current job finishes (~10 mins)
|
|
|
2. wait until `condor_status` no longer has the machine hostname in the list
|
|
|
3. (optional) if you need machine overnight, turn off autoupdate (reboots at 2:30) `systemctl stop autoupdate.timer`
|
|
|
4. run your workload
|
|
|
5. `systemctl reboot -i`
|
|
|
- any machine except **submit** can be temporarily removed from the MAIN cluster and used for other workloads
|
|
|
- machines in detached clusters can be used with condor turned on (when queue is empty and `autorespdiff.timer` is inactive)
|
|
|
- **HOWTO (temporarily turn off condor for a machine)**:
|
|
|
1. turn off condor and wait (~10m) until current job finishes: `remove-from-cluster`
|
|
|
2. (optional) if you need machine overnight, turn off autoupdate (reboots at 2:30) `systemctl stop autoupdate.timer`
|
|
|
3. run your workload
|
|
|
4. `systemctl reboot -i`
|
|
|
- **NOTE**: reboot will cause the machine to return to cluster (handled by `condor.service`) |
|
|
\ No newline at end of file |