NTP Debugging With Shared Memory Refclock
This is a preview version
General NTP Debugging Commands
The most important commands to be used to check the status of the NTP daemon, ntpd
, include:
ntpq -p
is the peers command which is used to print a table / billboard of the configured time sources, and the status of each time source. Each entry in the list can be addressed by an association ID (a handle) which is a random number that changes after each start of the daemon.
ntpq -c rv
is a short form ofntpq -c "rv 0"
which executes the readvars command for the special association ID 0. This refers to ntpd's system variables and status, including the stratum number and leap bits sent in reply to client requests.
ntpq -c "rv &1"
is the readvars command for a particular association. In this example,&1
refers to the association ID of the time source listed in the first line of thentpq -p
billboard,&2
would refer to the second line, if available, and so on. It prints more detailed status information for the particular time source, which can be a refclock or an upstream NTP server.
ntpq -c "cv &1"
executes the clockvars command for the specific time source,&1
in the example here. This command can only be used with associations that refer to a refclock, not with associations of upstream NTP servers, and the displayed information depends on the type of the refclock.
Some commands provide an output where the displayed time stamps may be in local time, according to the user's time zone settings, while other commands show timestamps always as UTC. Since NTP normally works with UTC only, it may be helpful to run the commands above with TZ=
prepended, which sets the local time zone for the following command to UTC.
Examining ntpd's Behavior In Different States
In the sections below we have a look at the status of the ntpd daemon which has been configured with the shared memory driver (driver 28) as single refclock. This setup requires an additional daemon which periodically feeds timestamps from a hardware clock into the shared memory driver, e.g. the mbgsvcd
daemon shipped with the Meinberg driver software for Linux.
Status Right After 'ntpd' Startup
If ntpd has just been (re-)started, but there's yet no process like mbgsvcd that feeds some data into the shared memory area, an entry for the SHM refclock is listed but the reach
column is 0
since no valid data has yet become available:
TZ= ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== SHM(0) .shm0. 0 l - 8 0 0.000 0.000 0.000
Let's have a look at ntpd's system variables:
TZ= ntpq -c 'rv 0' associd=0 status=c016 leap_alarm, sync_unspec, 1 event, restart, version=“ntpd 4.2.8p11@1.3790-o Wed Mar 7 12:03:29 UTC 2018 (3)”, processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=11, stratum=16, precision=-23, rootdelay=0.000, rootdisp=0.435, refid=INIT, reftime=00000000.00000000 Thu, Feb 7 2036 6:28:16.000, clock=de818267.c3d8bbe7 Wed, Apr 18 2018 8:37:59.765, peer=0, tc=3, mintc=3, offset=0.000000, frequency=0.000, sys_jitter=0.000000, clk_jitter=0.000, clk_wander=0.000, leapsec=201701010000, expire=201812280000
The output above shows leap=11
and stratum=16
which tells potential clients that this instance of ntpd has yet never been synchronized to any time source.
Also, peer=0
indicates there is yet no system peer, and the refid=INIT
tells ntpd is still initializing, but this is just informational.
We can examine the status of the only time source that has been configured:
TZ= ntpq -c 'rv &1' associd=31638 status=801b conf, sel_reject, 1 event, clock_event, srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=11, stratum=0, precision=-1, rootdelay=0.000, rootdisp=0.000, refid=shm0, reftime=00000000.00000000 Thu, Feb 7 2036 6:28:16.000, rec=00000000.00000000 Thu, Feb 7 2036 6:28:16.000, reach=000, unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=1200 peer_stratum, peer_unreach, keyid=0, ttl=0, offset=0.000, delay=0.000, dispersion=15937.500, jitter=0.000, filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtoffset= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtdisp= 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
reftime=00000000.00000000
indicates that no timestamps have been received, yet, and the value 16000.0
for each of the 8 columns in the filter dispersion tells that yet there are no entries in the filter.
At last we can have a look ath the clockvars:
TZ= ntpq -c 'cv &1' associd=31638 status=0041 4 events, clk_no_reply, device=“SHM/Shared memory interface”, timecode=, poll=4, noreply=4, badformat=0, baddata=0, stratum=0, refid=shm0, flags=0
In the output above we see timecode=
, so no timestamp has been fed into the driver by another daemon.
Status Right After 'mbgsvcd' Has Also Been Started
A few seconds after mbgsvcd has been started and began feeding reference timestamps into ntpd's shared memory driver:
TZ= ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== *SHM(0) .shm0. 0 l 1 8 3 0.000 -0.140 0.082
The reach
column is now 3
, so the shared memory driver could be polled successfully twice, in 8 second intervals.
Also, there is an asterisk '*'
at the beginning of the SHM line, so the SHM clock has been selected as system peer.
The system status is now:
TZ= ntpq -c 'rv 0' associd=0 status=c418 leap_alarm, sync_uhf_radio, 1 event, no_sys_peer, version=“ntpd 4.2.8p11@1.3790-o Wed Mar 7 12:03:29 UTC 2018 (3)”, processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=11, stratum=1, precision=-23, rootdelay=0.000, rootdisp=3937.676, refid=shm0, reftime=de8182da.d6f533a0 Wed, Apr 18 2018 8:39:54.839, clock=de8182db.26ebc750 Wed, Apr 18 2018 8:39:55.152, peer=31638, tc=3, mintc=3, offset=0.000000, frequency=0.000, sys_jitter=0.000000, clk_jitter=0.053, clk_wander=0.000, leapsec=201701010000, expire=201812280000
The stratum
is already set to 1
, and refid=shm0
and peer=31638
are already refering to the SHM refclock,
but leap=11
still indicates that ntpd is not yet fully synchronized.
The status of the SHM refclock also reflects this:
TZ= ntpq -c 'rv &1' associd=31638 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer, srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00, stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0, reftime=de8182da.7ce2cab7 Wed, Apr 18 2018 8:39:54.487, rec=de8182da.d6f533a0 Wed, Apr 18 2018 8:39:54.839, reach=003, unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=00 ok, keyid=0, ttl=0, offset=-0.140, delay=0.000, dispersion=3937.536, jitter=0.082, filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtoffset= -0.14 -0.06 0.00 0.00 0.00 0.00 0.00 0.00, filtdisp= 0.01 0.13 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
Now the reftime=de8182da.7ce2cab7
indicates the last timestamp that has been fed into the driver, and 2 of the 8 filter columns don't have a dispersion 16000.0, and thus are valid entries, which matches the indicated status reach=003
.
Also the clockvars show the last recent timestamp fed into the SHM driver:
TZ= ntpq -c 'cv &1' associd=31638 status=00f1 15 events, clk_no_reply, device=“SHM/Shared memory interface”, timecode=“2018-04-18T08:39:54.487678766Z”, poll=19, noreply=16, badformat=0, baddata=0, stratum=0, refid=shm0, flags=0
Status After 'mbgsvcd' Has Been Running For A While
After ntpd has already been running for a while and timestamps are continuously fed into the SHM driver, the reach
value should be continuously at '377':
TZ= ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== *SHM(0) .shm0. 0 l 2 8 377 0.000 -0.986 0.305
Now the system status has also leap=00
, which means the ntpd instance claims to be fully synchronized, and will be accepted by its clients:
TZ= ntpq -c 'rv 0' associd=0 status=0415 leap_none, sync_uhf_radio, 1 event, clock_sync, version=“ntpd 4.2.8p11@1.3790-o Wed Mar 7 12:03:29 UTC 2018 (3)”, processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=00, stratum=1, precision=-23, rootdelay=0.000, rootdisp=1.300, refid=shm0, reftime=de818342.d6f5f044 Wed, Apr 18 2018 8:41:38.839, clock=de818344.f9a636be Wed, Apr 18 2018 8:41:40.975, peer=31638, tc=3, mintc=3, offset=0.000000, frequency=0.000, sys_jitter=0.000000, clk_jitter=0.668, clk_wander=0.000, leapsec=201701010000, expire=201812280000
Also the refclock status has a valid time, and all 8 filter entries are valid:
TZ= ntpq -c 'rv &1' associd=31638 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer, srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00, stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0, reftime=de818342.89648c02 Wed, Apr 18 2018 8:41:38.536, rec=de818342.d6f5f044 Wed, Apr 18 2018 8:41:38.839, reach=377, unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=00 ok, keyid=0, ttl=0, offset=-0.986, delay=0.000, dispersion=0.124, jitter=0.305, filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtoffset= -0.99 -0.91 -0.84 -0.77 -0.71 -0.64 -0.58 -0.52, filtdisp= 0.01 0.13 0.25 0.37 0.49 0.61 0.73 0.85
Accordingly, the clockvars show a current time for timecode=
:
TZ= ntpq -c 'cv &1' associd=31638 status=00f1 15 events, clk_no_reply, device=“SHM/Shared memory interface”, timecode=“2018-04-18T08:41:38.535676240Z”, poll=32, noreply=16, badformat=0, baddata=0, stratum=0, refid=shm0, flags=0
Status After 'mbgsvcd' Has Just Been Killed
When mbgsvcd or any other program that fed the shared memor driver has been killed then there are no more timestamps coming in from that refclock, and thus the reach
value starts to degrade since for each unsuccessful polling a '0' bit is shifted in from the right:
TZ= ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== *SHM(0) .shm0. 0 l 21 8 374 0.000 -0.133 0.594
The system status still shows full synchronizatuion:
TZ= ntpq -c 'rv 0' associd=0 status=0415 leap_none, sync_uhf_radio, 1 event, clock_sync, version=“ntpd 4.2.8p11@1.3790-o Wed Mar 7 12:03:29 UTC 2018 (3)”, processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=00, stratum=1, precision=-23, rootdelay=0.000, rootdisp=1.300, refid=shm0, reftime=de81844a.d6f49cfa Wed, Apr 18 2018 8:46:02.839, clock=de81845f.3cef9cb1 Wed, Apr 18 2018 8:46:23.238, peer=31638, tc=3, mintc=3, offset=-0.132818, frequency=-10.513, sys_jitter=0.000000, clk_jitter=1.096, clk_wander=0.508, tai=37, leapsec=201701010000, expire=201812280000
Also the filter entries are still available in the clock status, but the clock status also sows a degraded reach
value:
TZ= ntpq -c 'rv &1' associd=31638 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer, srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00, stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0, reftime=de818443.a8d2304e Wed, Apr 18 2018 8:45:55.659, rec=de81844a.d6f49cfa Wed, Apr 18 2018 8:46:02.839, reach=374, unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=00 ok, keyid=0, ttl=0, offset=-0.133, delay=0.000, dispersion=0.121, jitter=0.594, filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtoffset= -0.13 -0.16 -0.23 -0.33 -0.48 -0.65 -0.97 -1.29, filtdisp= 0.00 0.13 0.25 0.38 0.50 0.62 0.75 0.89
The clockvars show an increasing number for noreply
:
TZ= ntpq -c 'cv &1' associd=31638 status=00f1 15 events, clk_no_reply, device=“SHM/Shared memory interface”, timecode=“2018-04-18T08:45:55.659324407Z”, poll=67, noreply=18, badformat=0, baddata=0, stratum=0, refid=shm0, flags=0
Status After 'mbgsvcd' Has Been Dead For Some Time
If no other program has been feeding the SHM driver for some time the reach
value finally decreases to 0 again. There's no more asterisk '*'
at the beginning of the SHM line, so the SHM clock isn't the system peer anymore:
TZ= ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== SHM(0) .shm0. 0 l 157 8 0 0.000 -0.133 0.000
Also the system status says no_sys_peer
, and peer=0
to indicate that the system peer has become unreachable.
However, the refid
still matches the refid of the former system peer, the leap
value stays at 00
and is not reverted to 03
, and also the stratum
value stays at 1
and is not reverted to 16. Instead, the reftime
timestamp still reflects the time of last synchronization with the former system peer, and the rootdisp
value increases over time, so the clients which see the increasing root dispersion can see that this instance of ntpd has started freewheeling:
TZ= ntpq -c 'rv 0' associd=0 status=0018 leap_none, sync_unspec, 1 event, no_sys_peer, version=“ntpd 4.2.8p11@1.3790-o Wed Mar 7 12:03:29 UTC 2018 (3)”, processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=00, stratum=1, precision=-23, rootdelay=0.000, rootdisp=3.340, refid=shm0, reftime=de81844a.d6f49cfa Wed, Apr 18 2018 8:46:02.839, clock=de8184e7.ce4a8109 Wed, Apr 18 2018 8:48:39.805, peer=0, tc=3, mintc=3, offset=-0.132818, frequency=-10.513, sys_jitter=0.000000, clk_jitter=1.096, clk_wander=0.508, tai=37, leapsec=201701010000, expire=201812280000
The clock status shows that times when the SHM driver has been updated and successfully polled the last time. However, the reach
value displayed here is also 0, and the filter has now been cleared:
TZ= ntpq -c 'rv &1' associd=31638 status=8013 conf, sel_reject, 1 event, unreachable, srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00, stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0, reftime=de818443.a8d2304e Wed, Apr 18 2018 8:45:55.659, rec=de81844a.d6f49cfa Wed, Apr 18 2018 8:46:02.839, reach=000, unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=1000 peer_unreach, keyid=0, ttl=0, offset=-0.133, delay=0.000, dispersion=15937.500, jitter=0.000, filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtoffset= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtdisp= 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
The clockvars still show the timestamp according to the time the SHM driver has been updated the last time, and the noreply
value continues to increase:
TZ= ntpq -c 'cv &1' associd=31638 status=00f1 15 events, clk_no_reply, device=“SHM/Shared memory interface”, timecode=“2018-04-18T08:45:55.659324407Z”, poll=84, noreply=35, badformat=0, baddata=0, stratum=0, refid=shm0, flags=0
Please note that the fact that timecode=
now shows a valid timecode indicates that the SHM driver has received some data before, in contrast to the state right after startup, when the timecode
string was still empty.
— Martin Burnicki martin.burnicki@meinberg.de 2018-04-18