Fainted: Reinitializing the (Non-)Authoritative File Replication Service (FRS, NTFRS) Database on Windows 2000 Servers
I post this for people who have had the same experience so they do not panic like I did.
The Symptoms
Yesterday one of my Windows Domain Controllers became inaccessible. Users were not able to login to their workstations and I even couldn't log into the server using my admin credentials. The server would not recognize the admin account so I guessed Active Directory must have been down. I had to shutdown the box completely and impolitely using the reset button. After that I did the usual routine checks in order to verify that things were running smoothly. But things weren't at all ok. After the restart, the File Replication Service could not play back its JET database logs and it started an non-authoritative restore (although it didn't say that). In the event log I found this:
Log |
Time |
Event Source |
EVENTID |
TEXT |
TEXT (English translation, my DC runs a German version of Windows) |
System |
15:30:17 |
EventLog |
Das System wurde zuvor am 16.03.2005 um 15:23:57 unerwartet heruntergefahren. |
The previous system shutdown at 3:23:57 PM on 03/16/2005 was unexpected. |
|
File Replication Service |
15:31:20 |
NtFrs |
Der Dateireplikationsdienst wird gestartet. |
The File Replication Service is starting. |
|
Application (!) |
15:31:39 |
ESENT |
ntfrs (1040) Datenbank 6.00.3940.0025 startet. |
ntfrs (1040) The database engine 6.00.3940.0025 started. |
|
Application |
15:31:40 |
ESENT |
ntfrs (1040) Die Datenbank initiiert Schritte zur Wiederherstellung. |
ntfrs (1040) The database engine is initiating recovery steps. |
|
Application |
15:31:41 |
ESENT |
ntfrs (1040) Die Datenbank spielt die Protokolldatei c:\ winnt\ ntfrs\ jet\ log\ edb00072.log ab. |
ntfrs (1040) The database engine is replaying log file c:\ winnt\ ntfrs\ jet\ log\ edb00072.log. |
|
[dito for edb00073.log and edb.log] |
|||||
Application |
15:31:42 |
ESENT |
ntfrs (1040) Die Wiederherstellung der Datenbank ist mit dem Fehler -566 fehlgeschlagen. |
ntfrs (1040) Database recovery/restore failed with unexpected error -566. |
|
Application |
15:31:43 |
ESENT |
ntfrs (1040) Datenbank 6.00.3940.0025 startet. |
ntfrs (1040) The database engine 6.00.3940.0025 started. |
|
Application |
15:31:53 |
ESENT |
ntfrs (1040) Die Datenbank wird beendet. |
ntfrs (1040) The database engine stopped. |
|
Application |
15:31:53 |
ESENT |
ntfrs (1040) Datenbank 6.00.3940.0025 startet. |
ntfrs (1040) The database engine 6.00.3940.0025 started. |
|
File Replication Service |
15:32:03 |
NtFrs |
Der Dateireplikationsdienst hat die vorhandenen Dateien in
d:\ remoteinstall nach d:\
remoteinstall\ NtFrs_ PreExisting___ See_ EventLog verschoben. |
The File Replication Service moved the preexisting files in
d:\ remoteinstall to d:\
remoteinstall\ NtFrs_ PreExisting___ See_ EventLog. |
|
File Replication Service |
15:32:04 |
NtFrs |
Der Dateireplikationsdienst hat diesen Computer dem folgenden
Replikatsatz hinzugefügt: |
The File Replication Service successfully added this computer
to the following replica set: "ROOT|REMOTEINSTALL" |
|
File Replication Service |
15:32:04 |
NtFrs |
Der Dateireplikationsdienst hat dem Replikatsatz die folgenden
Verbindungen hinzugefügt: |
The File Replication Service successfully added the connections
shown below to the replica set: "ROOT|REMOTEINSTALL" |
|
[ the above three entries occur for every replica on the server except SYSVOL] |
|||||
File Replication Service |
15:32:17 |
NtFrs |
Der Dateireplikationsdienst initialisiert das Systemvolume
mit Daten eines anderen Domänencontrollers. Der Computer OREN kann kein Domänencontroller
werden, solange dieser Prozess nicht abgeschlossen ist. Das Systemvolume
wird anschließend als SYSVOL freigegeben. |
File Replication Service is initializing the system volume
with data from another domain controller. computer OREN cannot become a domain
controller until this process is complete. The system volume will then be
shared as SYSVOL. |
|
[repeat the triad of 13520, 13553 and 13554 for SYSVOL] |
|||||
Application |
15:32:20 |
ESENT |
grovel (936) Die Datenbank spielt die Protokolldatei d:\ SIS Common Store\ edb.log ab. |
grovel (936) The database engine is replaying log file d:\ SIS Common Store\ edb.log. |
|
Application |
15:32:21 |
ESENT |
grovel (936) Die Datenbank hat erfolgreich die Schritte zur Wiederherstellung abgeschlossen. |
grovel (936) The database engine has successfully completed recovery steps. |
I checked the replica directories and with horror noticed that all files were
gone from all replicas. PANIC! Well, they weren't completely gone but they had
been moved into a subdirectory called NtFrs_PreExisting___See_EventLog
.
Similar havoc occurred when I upgraded one Windows 2000 domain controller
from SP2 to SP3 while the others still had SP2. During that incident I learned
about NTFRS, its JET Database and, most importantly, the reinitialization procedures
involving BurFlags
=D2
and =D4
.
Had I not known about these things, I would have panicked even more. Next thing
I did was shutdown the NTFRS service, which later turned out to be a mistake. I
needed to fix the problem quickly. When NTFRS isn't running, SYSVOL isn't replicated,
Active Directory is disabled and the server can not act as a domain controller
for its site.
A First Diagnosis
So what had happened? I'm not entirely sure, but this is what it looks like: NTFRS keeps track of replica changes in its own database. It's a JET database which is also known as ESENT (Extensible Storage Engine for NT). Any ESENT database tracks modifications in a journal. From the event log it becomes obvious that ESENT wasn't able to replay the journal and caused NTFRS to start from scratch. What exactly "from scratch" meant was not clear to me at that point. Would NTFRS have to replicate the whole 2G of data from another domain controller? Had that been the case, it would've taken several days or even weeks considering that the domain controllers are connected by a 64k line. Furthermore, it could have incurred a long downtime for that DC depending on the priority that is given to SYSVOL replication during that process.
Microsoft knowledge base article "FRS:
Using the BurFlags Registry Key to Reinitialize File Replication Service Replica
Sets" mentions two modes: non-authoritative (D2) and authoritative (D4).
In either mode, the existing files on the replicas are moved to a "NtFrs_PreExisting...
" folder
and replication with any upstream partner starts from scratch. But the articles
did not point out that a D2 does not transfer the files over the link. Instead,
D2 compares the file versions between the local and the upstream replica. The
files for which the upstream replica has the same version are moved back from "NtFrs_PreExisting...
" into
their original place. Only files that were modified on the upstream partner are
downloaded to the local replica. Files that are newer on the local replica or
that don't exist on the upstream replica are left in "NtFrs_PreExisting...". This
process is at least an order of magnitude faster than a full replication, which
is what an authoritative restore (D4) must be.
Side note: Something always bugs me when reading knowledge base articles. They explain to you in great detail how to modify a registry value ("Click Start, click execute, type cmd, type regedit, click on HKEY_LOCAL_MACHINE ...") despite the obvious fact that a person skillful enough to deal with NTFRS issues should be able to modify a registry key without a detailed explanation of where to click and what to type.
The Wrong Treatment
It seems that when I stopped NTFRS, the non-authoritative restore had already
begun so by stopping the service I had basically screwed things up even more.
But there is an easy way to manually trigger a non-authoritative restore. Following
the procedure described in the aforementioned KB
article, I had to stop the NTFRS service, set the global BurFlags
registry
value to hex D2
and restart NTFRS. That's what I did. NTFRS was smart
enough to restore SYSVOL first and once that was done a couple of minutes later,
the server could act as domain controller again.
The Root Cause
Why did the server crash in the first place? The short answer is: I don't know. It could have been a journal wrap. It could have been something else. As I said earlier, a journal is a way to ensure a complex on-disk data structure's consistency in situations were non-atomic writes might be interrupted by a software or hardware failure. Theoretically, journals of file systems and relational databases should contain at most several seconds of pending transactions. I may be wrong about this because it seems that ESENT journals can become quite big. In Service Pack 4 of Windows 2000, the NTFRS database's journal area is set to 512M. Why so much? I don't know. How could even 512M be insufficient sometimes? It beats me, I'll give you my best guess. Each time a file on a replica is modified, an entry is added to the journal. Once that modification is propagated to the other replicas, the corresponding journal entry is flushed. Microsoft says that the journal can wrap if too many modifications happen while the NTFRS service is not running. Microsoft also says that in the case of a journal wrap, the replication ignores files (probably the files that were in the journal when it wrapped).
In my case, two of the replicated shares in my domain are scheduled to replicate only during off-peak hours; at night and during the weekends, that is. As a matter of fact, just before the server crashed, I was working with files on one of those replicas. There must be a connection. My last action on that share was extracting an self-extracting hotfix package for Windows 2000 by running it with the /X switch. After that it seemed that nothing more would happen, so I assumed the self-extractor had just terminated. Then one of my users told me that they were not able to log in anymore. You know what happened next. When I returned to my workstation, it showed a message box saying "Extraction failed" or something similar. Maybe the self-extracting hotfix package did something unusual to the file system on the replicate which cause NTFRS to go berserk. Should I try to reproduce the problem? I don't think so.
The Right Treatment
What should I have done? I should have read the event log message 13520 more carefully. I should not have stopped NTFRS because it had already started the non-authoritative restore. I learned another lesson that is not documented anywhere - it's abut interrupting the restore and how to restart it properly.
For The Impatient Patient
Try not to interrupt restoration process. If you absolutely have shut NTFRS down or reboot the server, make sure you follow this procedure:
- Shut NTFRS down.
- Move all files and directories from the
NtFrs_PreExisting___See_EventLog
back to their original place and remove the emptyNtFrs_PreExisting___See_EventLog
folder. - Set the global
BurFlags
registry value toD2
. Alternatively, you can decrease downtime by checking whether the restoration process has already finished for some of your replicated shares. IfNtFrs_PreExisting___See_EventLog
is empty on some of the replicated shares, it means that the restoration has already finished for that particular share and it does not necessarily be restarted again. Double check using the event log. It might be that the restore hasn't even started yet, in which caseNtFrs_PreExisting___See_EventLog
might be empty or not exist at all! You can trigger replication for individual replicas by setting the replica-specificBurFlags
registry value toD2
. If unsure, use the globalBurFlags
. - Reboot or restart NTFRS.
The reason for step 3 is that BurFlags is reset to 0 immediately after NTFRS starts. You need to set it back to D2 every time you intend to start it with an (non-)authoritative restore.
The reason for moving the files back from NtFrs_PreExisting___See_EventLog
in
step 2 is that when the restore starts, the contents of NtFrs_PreExisting___See_EventLog
will
be deleted! It doesn't make sense but that's what happens. If you skip step 2,
you will lose everything that wasn't yet restored when you interrupted the
non-authoritative restore.