NTP Debugging With Shared Memory Refclock

This is a preview version

The most important commands to be used to check the status of the NTP daemon, ntpd, include:

ntpq -p is the peers command which is used to print a table / billboard of the configured time sources, and the status of each time source. Each entry in the list can be addressed by an association ID (a handle) which is a random number that changes after each start of the daemon.

ntpq -c rv is a short form of ntpq -c "rv 0" which executes the readvars command for the special association ID 0. This refers to ntpd's system variables and status, including the stratum number and leap bits sent in reply to client requests.

ntpq -c "rv &1" is the readvars command for a particular association. In this example, &1 refers to the association ID of the time source listed in the first line of the ntpq -p billboard, &2 would refer to the second line, if available, and so on. It prints more detailed status information for the particular time source, which can be a refclock or an upstream NTP server.

ntpq -c "cv &1" executes the clockvars command for the specific time source, &1 in the example here. This command can only be used with associations that refer to a refclock, not with associations of upstream NTP servers, and the displayed information depends on the type of the refclock.

Some commands provide an output where the displayed time stamps may be in local time, according to the user's time zone settings, while other commands show timestamps always as UTC. Since NTP normally works with UTC only, it may be helpful to run the commands above with TZ= prepended, which sets the local time zone for the following command to UTC.

In the sections below we have a look at the status of the ntpd daemon which has been configured with the shared memory driver (driver 28) as single refclock. This setup requires an additional daemon which periodically feeds timestamps from a hardware clock into the shared memory driver, e.g. the mbgsvcd daemon shipped with the Meinberg driver software for Linux.

If ntpd has just been (re-)started, but there's yet no process like mbgsvcd that feeds some data into the shared memory area, an entry for the SHM refclock is listed but the reach column is 0 since no valid data has yet become available:

TZ= ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 SHM(0)          .shm0.           0 l    -    8    0    0.000    0.000   0.000

Let's have a look at ntpd's system variables:

TZ= ntpq -c 'rv 0'
associd=0 status=c016 leap_alarm, sync_unspec, 1 event, restart,
version=“ntpd 4.2.8p11@1.3790-o Wed Mar  7 12:03:29 UTC 2018 (3)”,
processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=11,
stratum=16, precision=-23, rootdelay=0.000, rootdisp=0.435, refid=INIT,
reftime=00000000.00000000  Thu, Feb  7 2036  6:28:16.000,
clock=de818267.c3d8bbe7  Wed, Apr 18 2018  8:37:59.765, peer=0, tc=3,
mintc=3, offset=0.000000, frequency=0.000, sys_jitter=0.000000,
clk_jitter=0.000, clk_wander=0.000, leapsec=201701010000,
expire=201812280000

The output above shows leap=11 and stratum=16 which tells potential clients that this instance of ntpd has yet never been synchronized to any time source. Also, peer=0 indicates there is yet no system peer, and the refid=INIT tells ntpd is still initializing, but this is just informational.

We can examine the status of the only time source that has been configured:

TZ= ntpq -c 'rv &1'
associd=31638 status=801b conf, sel_reject, 1 event, clock_event,
srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=11,
stratum=0, precision=-1, rootdelay=0.000, rootdisp=0.000, refid=shm0,
reftime=00000000.00000000  Thu, Feb  7 2036  6:28:16.000,
rec=00000000.00000000  Thu, Feb  7 2036  6:28:16.000, reach=000,
unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0,
flash=1200 peer_stratum, peer_unreach, keyid=0, ttl=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0

reftime=00000000.00000000 indicates that no timestamps have been received, yet, and the value 16000.0 for each of the 8 columns in the filter dispersion tells that yet there are no entries in the filter.

At last we can have a look ath the clockvars:

TZ= ntpq -c 'cv &1'
associd=31638 status=0041 4 events, clk_no_reply,
device=“SHM/Shared memory interface”, timecode=, poll=4, noreply=4,
badformat=0, baddata=0, stratum=0, refid=shm0, flags=0

In the output above we see timecode=, so no timestamp has been fed into the driver by another daemon.

A few seconds after mbgsvcd has been started and began feeding reference timestamps into ntpd's shared memory driver:

TZ= ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*SHM(0)          .shm0.           0 l    1    8    3    0.000   -0.140   0.082

The reach column is now 3, so the shared memory driver could be polled successfully twice, in 8 second intervals. Also, there is an asterisk '*' at the beginning of the SHM line, so the SHM clock has been selected as system peer.

The system status is now:

TZ= ntpq -c 'rv 0'
associd=0 status=c418 leap_alarm, sync_uhf_radio, 1 event, no_sys_peer,
version=“ntpd 4.2.8p11@1.3790-o Wed Mar  7 12:03:29 UTC 2018 (3)”,
processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=11, stratum=1,
precision=-23, rootdelay=0.000, rootdisp=3937.676, refid=shm0,
reftime=de8182da.d6f533a0  Wed, Apr 18 2018  8:39:54.839,
clock=de8182db.26ebc750  Wed, Apr 18 2018  8:39:55.152, peer=31638, tc=3,
mintc=3, offset=0.000000, frequency=0.000, sys_jitter=0.000000,
clk_jitter=0.053, clk_wander=0.000, leapsec=201701010000,
expire=201812280000

The stratum is already set to 1, and refid=shm0 and peer=31638 are already refering to the SHM refclock, but leap=11 still indicates that ntpd is not yet fully synchronized.

The status of the SHM refclock also reflects this:

TZ= ntpq -c 'rv &1'
associd=31638 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer,
srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0,
reftime=de8182da.7ce2cab7  Wed, Apr 18 2018  8:39:54.487,
rec=de8182da.d6f533a0  Wed, Apr 18 2018  8:39:54.839, reach=003,
unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=00 ok,
keyid=0, ttl=0, offset=-0.140, delay=0.000, dispersion=3937.536,
jitter=0.082,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=   -0.14   -0.06    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=      0.01    0.13 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0

Now the reftime=de8182da.7ce2cab7 indicates the last timestamp that has been fed into the driver, and 2 of the 8 filter columns don't have a dispersion 16000.0, and thus are valid entries, which matches the indicated status reach=003.

Also the clockvars show the last recent timestamp fed into the SHM driver:

TZ= ntpq -c 'cv &1'
associd=31638 status=00f1 15 events, clk_no_reply,
device=“SHM/Shared memory interface”,
timecode=“2018-04-18T08:39:54.487678766Z”, poll=19, noreply=16,
badformat=0, baddata=0, stratum=0, refid=shm0, flags=0

After ntpd has already been running for a while and timestamps are continuously fed into the SHM driver, the reach value should be continuously at '377':

TZ= ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*SHM(0)          .shm0.           0 l    2    8  377    0.000   -0.986   0.305

Now the system status has also leap=00, which means the ntpd instance claims to be fully synchronized, and will be accepted by its clients:

TZ= ntpq -c 'rv 0'
associd=0 status=0415 leap_none, sync_uhf_radio, 1 event, clock_sync,
version=“ntpd 4.2.8p11@1.3790-o Wed Mar  7 12:03:29 UTC 2018 (3)”,
processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=00, stratum=1,
precision=-23, rootdelay=0.000, rootdisp=1.300, refid=shm0,
reftime=de818342.d6f5f044  Wed, Apr 18 2018  8:41:38.839,
clock=de818344.f9a636be  Wed, Apr 18 2018  8:41:40.975, peer=31638, tc=3,
mintc=3, offset=0.000000, frequency=0.000, sys_jitter=0.000000,
clk_jitter=0.668, clk_wander=0.000, leapsec=201701010000,
expire=201812280000

Also the refclock status has a valid time, and all 8 filter entries are valid:

TZ= ntpq -c 'rv &1'
associd=31638 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer,
srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0,
reftime=de818342.89648c02  Wed, Apr 18 2018  8:41:38.536,
rec=de818342.d6f5f044  Wed, Apr 18 2018  8:41:38.839, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=00 ok,
keyid=0, ttl=0, offset=-0.986, delay=0.000, dispersion=0.124,
jitter=0.305,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=   -0.99   -0.91   -0.84   -0.77   -0.71   -0.64   -0.58   -0.52,
filtdisp=      0.01    0.13    0.25    0.37    0.49    0.61    0.73    0.85

Accordingly, the clockvars show a current time for timecode=:

TZ= ntpq -c 'cv &1'
associd=31638 status=00f1 15 events, clk_no_reply,
device=“SHM/Shared memory interface”,
timecode=“2018-04-18T08:41:38.535676240Z”, poll=32, noreply=16,
badformat=0, baddata=0, stratum=0, refid=shm0, flags=0

When mbgsvcd or any other program that fed the shared memor driver has been killed then there are no more timestamps coming in from that refclock, and thus the reach value starts to degrade since for each unsuccessful polling a '0' bit is shifted in from the right:

TZ= ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*SHM(0)          .shm0.           0 l   21    8  374    0.000   -0.133   0.594

The system status still shows full synchronizatuion:

TZ= ntpq -c 'rv 0'
associd=0 status=0415 leap_none, sync_uhf_radio, 1 event, clock_sync,
version=“ntpd 4.2.8p11@1.3790-o Wed Mar  7 12:03:29 UTC 2018 (3)”,
processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=00, stratum=1,
precision=-23, rootdelay=0.000, rootdisp=1.300, refid=shm0,
reftime=de81844a.d6f49cfa  Wed, Apr 18 2018  8:46:02.839,
clock=de81845f.3cef9cb1  Wed, Apr 18 2018  8:46:23.238, peer=31638, tc=3,
mintc=3, offset=-0.132818, frequency=-10.513, sys_jitter=0.000000,
clk_jitter=1.096, clk_wander=0.508, tai=37, leapsec=201701010000,
expire=201812280000

Also the filter entries are still available in the clock status, but the clock status also sows a degraded reach value:

TZ= ntpq -c 'rv &1'
associd=31638 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer,
srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0,
reftime=de818443.a8d2304e  Wed, Apr 18 2018  8:45:55.659,
rec=de81844a.d6f49cfa  Wed, Apr 18 2018  8:46:02.839, reach=374,
unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0, flash=00 ok,
keyid=0, ttl=0, offset=-0.133, delay=0.000, dispersion=0.121,
jitter=0.594,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=   -0.13   -0.16   -0.23   -0.33   -0.48   -0.65   -0.97   -1.29,
filtdisp=      0.00    0.13    0.25    0.38    0.50    0.62    0.75    0.89

The clockvars show an increasing number for noreply:

TZ= ntpq -c 'cv &1'
associd=31638 status=00f1 15 events, clk_no_reply,
device=“SHM/Shared memory interface”,
timecode=“2018-04-18T08:45:55.659324407Z”, poll=67, noreply=18,
badformat=0, baddata=0, stratum=0, refid=shm0, flags=0

If no other program has been feeding the SHM driver for some time the reach value finally decreases to 0 again. There's no more asterisk '*' at the beginning of the SHM line, so the SHM clock isn't the system peer anymore:

TZ= ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 SHM(0)          .shm0.           0 l  157    8    0    0.000   -0.133   0.000

Also the system status says no_sys_peer, and peer=0 to indicate that the system peer has become unreachable. However, the refid still matches the refid of the former system peer, the leap value stays at 00 and is not reverted to 03, and also the stratum value stays at 1 and is not reverted to 16. Instead, the reftime timestamp still reflects the time of last synchronization with the former system peer, and the rootdisp value increases over time, so the clients which see the increasing root dispersion can see that this instance of ntpd has started freewheeling:

TZ= ntpq -c 'rv 0'
associd=0 status=0018 leap_none, sync_unspec, 1 event, no_sys_peer,
version=“ntpd 4.2.8p11@1.3790-o Wed Mar  7 12:03:29 UTC 2018 (3)”,
processor=“x86_64”, system=“Linux/3.16.7-53-desktop”, leap=00, stratum=1,
precision=-23, rootdelay=0.000, rootdisp=3.340, refid=shm0,
reftime=de81844a.d6f49cfa  Wed, Apr 18 2018  8:46:02.839,
clock=de8184e7.ce4a8109  Wed, Apr 18 2018  8:48:39.805, peer=0, tc=3,
mintc=3, offset=-0.132818, frequency=-10.513, sys_jitter=0.000000,
clk_jitter=1.096, clk_wander=0.508, tai=37, leapsec=201701010000,
expire=201812280000

The clock status shows that times when the SHM driver has been updated and successfully polled the last time. However, the reach value displayed here is also 0, and the filter has now been cleared:

TZ= ntpq -c 'rv &1'
associd=31638 status=8013 conf, sel_reject, 1 event, unreachable,
srcadr=SHM(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=shm0,
reftime=de818443.a8d2304e  Wed, Apr 18 2018  8:45:55.659,
rec=de81844a.d6f49cfa  Wed, Apr 18 2018  8:46:02.839, reach=000,
unreach=0, hmode=3, pmode=4, hpoll=3, ppoll=3, headway=0,
flash=1000 peer_unreach, keyid=0, ttl=0, offset=-0.133, delay=0.000,
dispersion=15937.500, jitter=0.000,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0

The clockvars still show the timestamp according to the time the SHM driver has been updated the last time, and the noreply value continues to increase:

TZ= ntpq -c 'cv &1'
associd=31638 status=00f1 15 events, clk_no_reply,
device=“SHM/Shared memory interface”,
timecode=“2018-04-18T08:45:55.659324407Z”, poll=84, noreply=35,
badformat=0, baddata=0, stratum=0, refid=shm0, flags=0

Please note that the fact that timecode= now shows a valid timecode indicates that the SHM driver has received some data before, in contrast to the state right after startup, when the timecode string was still empty.

— Martin Burnicki martin.burnicki@meinberg.de 2018-04-18

NTP Debugging With Shared Memory Refclock

General NTP Debugging Commands

Examining ntpd's Behavior In Different States

Status Right After 'ntpd' Startup

Status Right After 'mbgsvcd' Has Also Been Started

Status After 'mbgsvcd' Has Been Running For A While

Status After 'mbgsvcd' Has Just Been Killed

Status After 'mbgsvcd' Has Been Dead For Some Time