Fainted: Reinitializing the (Non-)Authoritative File Replication Service (FRS, NTFRS) Database on Windows 2000 Servers

Submitted by Hannes Schmidt on Thu, 03/17/2005 - 16:11.

I post this for people who have had the same experience so they do not panic like I did.

The Symptoms

Yesterday one of my Windows Domain Controllers became inaccessible. Users were not able to login to their workstations and I even couldn't log into the server using my admin credentials. The server would not recognize the admin account so I guessed Active Directory must have been down. I had to shutdown the box completely and impolitely using the reset button. After that I did the usual routine checks in order to verify that things were running smoothly. But things weren't at all ok. After the restart, the File Replication Service could not play back its JET database logs and it started an non-authoritative restore (although it didn't say that). In the event log I found this:

Log

Time

Event Source

EVENTID

TEXT

TEXT (English translation, my DC runs a German version of Windows)

System

15:30:17

EventLog

6008

Das System wurde zuvor am 16.03.2005 um 15:23:57 unerwartet heruntergefahren.

The previous system shutdown at 3:23:57 PM on 03/16/2005 was unexpected.

File Replication Service

15:31:20

NtFrs

13501

Der Dateireplikationsdienst wird gestartet.

The File Replication Service is starting.

Application (!)

15:31:39

ESENT

100

ntfrs (1040) Datenbank 6.00.3940.0025 startet.

ntfrs (1040) The database engine 6.00.3940.0025 started.

Application

15:31:40

ESENT

300

ntfrs (1040) Die Datenbank initiiert Schritte zur Wiederherstellung.

ntfrs (1040) The database engine is initiating recovery steps.

Application

15:31:41

ESENT

301

ntfrs (1040) Die Datenbank spielt die Protokolldatei c:\ winnt\ ntfrs\ jet\ log\ edb00072.log ab.

ntfrs (1040) The database engine is replaying log file c:\ winnt\ ntfrs\ jet\ log\ edb00072.log.

[dito for edb00073.log and edb.log]

Application

15:31:42

ESENT

454

ntfrs (1040) Die Wiederherstellung der Datenbank ist mit dem Fehler -566 fehlgeschlagen.

ntfrs (1040) Database recovery/restore failed with unexpected error -566.

Application

15:31:43

ESENT

100

ntfrs (1040) Datenbank 6.00.3940.0025 startet.

ntfrs (1040) The database engine 6.00.3940.0025 started.

Application

15:31:53

ESENT

101

ntfrs (1040) Die Datenbank wird beendet.

ntfrs (1040) The database engine stopped.

Application

15:31:53

ESENT

100

ntfrs (1040) Datenbank 6.00.3940.0025 startet.

ntfrs (1040) The database engine 6.00.3940.0025 started.

File Replication Service

15:32:03

NtFrs

13520

Der Dateireplikationsdienst hat die vorhandenen Dateien in d:\ remoteinstall nach d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog verschoben.

Der Dateireplikationsdienst kann die Dateien in d:\ remoteinstall\ NtFrs_ PreExisting___ See_EventLog jederzeit löschen. Sie können die Dateien vor dem Löschen sichern, indem Sie sie von d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog kopieren. Das Kopieren der Dateien nach d:\ remoteinstall kann zu Namenskonflikten führen, wenn die Dateien auf einem anderen replizierenden Computer bereits vorhanden sind.

In manchen Fällen kopiert der Dateireplikationsdienst eine Datei von d:\ remoteinstall\ NtFrs_ PreExisting___ See_EventLog nach d:\ remoteinstall, anstatt die Datei von einem anderen Computer zu replizieren.

Sie können jederzeit Speicher freisetzen, indem Sie die Dateien in d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog löschen.

The File Replication Service moved the preexisting files in d:\ remoteinstall to d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog.

The File Replication Service may delete the files in d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog at any time. Files can be saved from deletion by copying them out of d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog. Copying the files into d:\ remoteinstall may lead to name conflicts if the files already exist on some other replicating partner.

In some cases, the File Replication Service may copy a file from d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog into d:\ remoteinstall instead of replicating the file from some other replicating partner.

Space can be recovered at any time by deleting the files in d:\ remoteinstall\ NtFrs_ PreExisting___ See_ EventLog.

File Replication Service

15:32:04

NtFrs

13553

Der Dateireplikationsdienst hat diesen Computer dem folgenden Replikatsatz hinzugefügt:
"ROOT|REMOTEINSTALL"

Ereignisinformationen:
DNS-Name des Computers: "oren.xxx.yyy.de"
Mitgliedname des Replikatsatzes: "{ABA4FE98- 8429- 45F8- 9EA1- 0AC9F3FAC45A}"
Stammpfad des Replikatsatzes: "d:\ remoteinstall"
Replikatstagingverzeichnis: "e:\ frs-staging"
Pfad des Replikat arbeits verzeichnisses: "c:\ winnt\ ntfrs\ jet"

The File Replication Service successfully added this computer to the following replica set: "ROOT|REMOTEINSTALL"

Information related to this event is shown below: Computer DNS name is "oren.xxx.yyy.de" Replica set member name is "{ABA4FE98- 8429- 45F8- 9EA1- 0AC9F3FAC45A}" Replica set root path is "d:\ remoteinstall" Replica staging directory path is "e:\ frs-staging" Replica working directory path is "c:\ winnt\ ntfrs\ jet" 

File Replication Service

15:32:04

NtFrs

13554

Der Dateireplikationsdienst hat dem Replikatsatz die folgenden Verbindungen hinzugefügt:
"ROOT|REMOTEINSTALL"

Ausgehend nach "oleg.xxx.yyy.de"
Eingehend von "oleg.xxx.yyy.de"

The File Replication Service successfully added the connections shown below to the replica set: "ROOT|REMOTEINSTALL"

Outbound to "oleg.xxx.yyy.de"
Inbound from "oleg.xxx.yyy.de"

[ the above three entries occur for every replica on the server except SYSVOL]

File Replication Service

15:32:17

NtFrs

13565

Der Dateireplikationsdienst initialisiert das Systemvolume mit Daten eines anderen Domänencontrollers. Der Computer OREN kann kein Domänencontroller werden, solange dieser Prozess nicht abgeschlossen ist. Das Systemvolume wird anschließend als SYSVOL freigegeben.

Geben Sie in der Eingabeaufforderung Folgendes ein, um die SYSVOL-Freigabe zu überprüfen:
net share

Die SYSVOL-Freigabe erscheint, wenn der Dateireplikationsdienst den Initialisierungsprozess abgeschlossen hat.

Die Initialisierung des Systemvolumes kann einige Zeit in Anspruch nehmen. Der Zeitaufwand ist von der Datenmenge im Systemvolume, der Verfügbarkeit anderer Domänencontroller, und dem Replikationsintervall zwischen anderen Domänencontrollern abhängig.

File Replication Service is initializing the system volume with data from another domain controller. computer OREN cannot become a domain controller until this process is complete. The system volume will then be shared as SYSVOL.

To check for the SYSVOL share, at the command prompt, type:
net share

When File Replication Service completes the initialization process, the SYSVOL share will appear.

The initialization of the system volume can take some time. The time is dependent on the amount of data in the system volume, the availability of other domain controllers, and the replication interval between domain controllers. 

[repeat the triad of 13520, 13553 and 13554 for SYSVOL]

Application

15:32:20

ESENT

301

grovel (936) Die Datenbank spielt die Protokolldatei d:\ SIS Common Store\ edb.log ab.

grovel (936) The database engine is replaying log file d:\ SIS Common Store\ edb.log.

Application

15:32:21

ESENT

302

grovel (936) Die Datenbank hat erfolgreich die Schritte zur Wiederherstellung abgeschlossen.

grovel (936) The database engine has successfully completed recovery steps.

I checked the replica directories and with horror noticed that all files were gone from all replicas. PANIC! Well, they weren't completely gone but they had been moved into a subdirectory called NtFrs_PreExisting___See_EventLog. Similar havoc occurred when I upgraded one Windows 2000 domain controller from SP2 to SP3 while the others still had SP2. During that incident I learned about NTFRS, its JET Database and, most importantly, the reinitialization procedures involving BurFlags=D2 and =D4. Had I not known about these things, I would have panicked even more. Next thing I did was shutdown the NTFRS service, which later turned out to be a mistake. I needed to fix the problem quickly. When NTFRS isn't running, SYSVOL isn't replicated, Active Directory is disabled and the server can not act as a domain controller for its site.

A First Diagnosis

So what had happened? I'm not entirely sure, but this is what it looks like: NTFRS keeps track of replica changes in its own database. It's a JET database which is also known as ESENT (Extensible Storage Engine for NT). Any ESENT database tracks modifications in a journal. From the event log it becomes obvious that ESENT wasn't able to replay the journal and caused NTFRS to start from scratch. What exactly "from scratch" meant was not clear to me at that point. Would NTFRS have to replicate the whole 2G of data from another domain controller? Had that been the case, it would've taken several days or even weeks considering that the domain controllers are connected by a 64k line. Furthermore, it could have incurred a long downtime for that DC depending on the priority that is given to SYSVOL replication during that process.

Microsoft knowledge base article "FRS: Using the BurFlags Registry Key to Reinitialize File Replication Service Replica Sets" mentions two modes: non-authoritative (D2) and authoritative (D4). In either mode, the existing files on the replicas are moved to a "NtFrs_PreExisting..." folder and replication with any upstream partner starts from scratch. But the articles did not point out that a D2 does not transfer the files over the link. Instead, D2 compares the file versions between the local and the upstream replica. The files for which the upstream replica has the same version are moved back from "NtFrs_PreExisting..." into their original place. Only files that were modified on the upstream partner are downloaded to the local replica. Files that are newer on the local replica or that don't exist on the upstream replica are left in "NtFrs_PreExisting...". This process is at least an order of magnitude faster than a full replication, which is what an authoritative restore (D4) must be.

Side note: Something always bugs me when reading knowledge base articles. They explain to you in great detail how to modify a registry value ("Click Start, click execute, type cmd, type regedit, click on HKEY_LOCAL_MACHINE ...") despite the obvious fact that a person skillful enough to deal with NTFRS issues should be able to modify a registry key without a detailed explanation of where to click and what to type.

The Wrong Treatment

It seems that when I stopped NTFRS, the non-authoritative restore had already begun so by stopping the service I had basically screwed things up even more. But there is an easy way to manually trigger a non-authoritative restore. Following the procedure described in the aforementioned KB article, I had to stop the NTFRS service, set the global BurFlags registry value to hex D2 and restart NTFRS. That's what I did. NTFRS was smart enough to restore SYSVOL first and once that was done a couple of minutes later, the server could act as domain controller again.

The Root Cause

Why did the server crash in the first place? The short answer is: I don't know. It could have been a journal wrap. It could have been something else. As I said earlier, a journal is a way to ensure a complex on-disk data structure's consistency in situations were non-atomic writes might be interrupted by a software or hardware failure. Theoretically, journals of file systems and relational databases should contain at most several seconds of pending transactions. I may be wrong about this because it seems that ESENT journals can become quite big. In Service Pack 4 of Windows 2000, the NTFRS database's journal area is set to 512M. Why so much? I don't know. How could even 512M be insufficient sometimes? It beats me, I'll give you my best guess. Each time a file on a replica is modified, an entry is added to the journal. Once that modification is propagated to the other replicas, the corresponding journal entry is flushed. Microsoft says that the journal can wrap if too many modifications happen while the NTFRS service is not running. Microsoft also says that in the case of a journal wrap, the replication ignores files (probably the files that were in the journal when it wrapped).

In my case, two of the replicated shares in my domain are scheduled to replicate only during off-peak hours; at night and during the weekends, that is. As a matter of fact, just before the server crashed, I was working with files on one of those replicas. There must be a connection. My last action on that share was extracting an self-extracting hotfix package for Windows 2000 by running it with the /X switch. After that it seemed that nothing more would happen, so I assumed the self-extractor had just terminated. Then one of my users told me that they were not able to log in anymore. You know what happened next. When I returned to my workstation, it showed a message box saying "Extraction failed" or something similar. Maybe the self-extracting hotfix package did something unusual to the file system on the replicate which cause NTFRS to go berserk. Should I try to reproduce the problem? I don't think so.

The Right Treatment

What should I have done? I should have read the event log message 13520 more carefully. I should not have stopped NTFRS because it had already started the non-authoritative restore. I learned another lesson that is not documented anywhere - it's abut interrupting the restore and how to restart it properly.

For The Impatient Patient

Try not to interrupt restoration process. If you absolutely have shut NTFRS down or reboot the server, make sure you follow this procedure:

  1. Shut NTFRS down.
  2. Move all files and directories from the NtFrs_PreExisting___See_EventLog back to their original place and remove the empty NtFrs_PreExisting___See_EventLog folder.
  3. Set the global BurFlags registry value to D2. Alternatively, you can decrease downtime by checking whether the restoration process has already finished for some of your replicated shares. If NtFrs_PreExisting___See_EventLog is empty on some of the replicated shares, it means that the restoration has already finished for that particular share and it does not necessarily be restarted again. Double check using the event log. It might be that the restore hasn't even started yet, in which case NtFrs_PreExisting___See_EventLog might be empty or not exist at all! You can trigger replication for individual replicas by setting the replica-specific BurFlags registry value to D2. If unsure, use the global BurFlags.
  4. Reboot or restart NTFRS.

The reason for step 3 is that BurFlags is reset to 0 immediately after NTFRS starts. You need to set it back to D2 every time you intend to start it with an (non-)authoritative restore.

The reason for moving the files back from NtFrs_PreExisting___See_EventLog in step 2 is that when the restore starts, the contents of NtFrs_PreExisting___See_EventLog will be deleted! It doesn't make sense but that's what happens. If you skip step 2, you will lose everything that wasn't yet restored when you interrupted the non-authoritative restore.

( categories: Windows | Administrator )
Submitted by Anonymous on Tue, 12/05/2006 - 13:43.

We are from Catalonia and we have a problem in a server in CasaBlanca Maroco. After we flied to Moroco we found this page, and you solve owr Problem.

Thanks a lot!!!

Pere Lluis Sanz & Josep Prada
From Sabadell
Catalonia

Submitted by Anonymous (not verified) on Sun, 08/28/2005 - 21:57.

Id like to thank the authur of this article,

you helped me fix my NTFRS on a SBS 2003 Premium with the above info

I had to perform a D4 Authoritative BurFlag :)

regards

John K