Search billions of records on Ancestry.com

Hosted by RootsWeb

USGenWeb Archives


The USGenWeb ArchivesTOC Tips#3

What's in a Name

Two of the most important functions of a File Manager are to name the documents that are contributed, and to give documents a file name before posting.  The purpose of this discussion is file naming, and why it is so important.  

Every file that is stored in the USGenWeb Archives has a unique Universal Resource Identifier, or URI.  (URI replaced URL, or Uniform Resource Locator.  Both terms may be used interchangeably.  For more info, see: http://www.w3.org/Addressing/ ).  File Managers are responsible for two items in the URI, the folder name and the file name.

As an example, a document contributed from Burbank, California might be named:

Bill of Sale for Carrot Patch, Elmer Fudd to Bugs Bunny, 1 Apr 1931

A file name for that document could be:  fudd01.txt

The URI for this Bill of Sale example would be:

http://ftp.rootsweb.com/pub/usgenweb/ca/losangeles/deeds/fudd01.txt

The folder name "deeds" was picked from the list of acceptable sub- directory names specified in Guidelines for State File Managers/Archivists:

http://www.rootsweb.com/~usgenweb/guide2.htm

The file name, fudd01.txt, was chosen so that there would be only one document stored in the Deeds Folder for Los Angeles County, California.  If that file name had been previously used, or is accidentally reused, the original document will be overwritten and lost.

While discussing this file name, let's have a quick review of the header that goes on each file.

Every document that is posted in the USGenWeb Archives must have the following items plainly shown (the above carrot patch sale used here as an example):

The State and County Los Angeles County, CA
The Category Deeds
The Document Name Bill of Sale for Carrot Patch,
Elmer Fudd to Bugs Bunny, 1 Apr 1931
The Name of the Contributor Optional if the contributor so requests
E-mail of the Contributor Optional if the contributor so requests
The USGenWeb Archives Notice Either the long or short version

The Notice may be placed at the bottom of the document.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Users of the USGenWeb access archived documents in one of several ways:

Search Engine.  The two USGenWeb Archive Search Engines provide sufficient details about a document so a user will know if it possibly contains the information they are looking for.  Likewise, Google and other commercial search engines provide an excerpt which describes the document content.  The file name is not important for searches.

Tables of Content (TOC).  The TOCs display a document name which describes the content.  Some TOCs also display the file name, but it is not used as the primary source for document content.

Recent and Daily Uploads Reports.  Many states and some counties display a list of uploaded files for the previous week.  The Daily Uploads report mail list has about 1600 subscribers.  These lists display only the URI, where the file name is the only indication of what is in a document.

FTP Directories.  Prior to the popularity of the World Wide Web and the browsers which made it possible, many people surfed the Internet using the FTP directories.  The USGenWeb Archives were initially set up for access via FTP, and there are still people who use this as their primary means of accessing files.  Navigating through the FTP directories reveals the state/county/folder name, but the users are left with only the file name to determine if the information they are looking for might be in a document.  Every file name within a directory must be unique.  The same file name may be used in another directory, since the URI will use the different state/county/category.

A reminder - ASCII Text files are the only files that are to be placed in the ftp directories.  All other files go in the html directory.  Be sure that these files are pure ASCII Text so that they may be viewed by all current and future operating systems.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are several considerations which must be given to choosing a good file name.

Content. The file name MUST describe the content.  In our carrot patch sale example, using sale01.txt or deed01.txt or dd01041931.txt would all be inadequate.  A combination of document type and sequence numbers, i.e., deed01, deed02, deed03, should NEVER be used to name files, as they do not provide the user with any idea of what is in the document.

Describing the content via the file name can result in very short name on one end of a scale, to elaborate names on the other end.   Examples:

fudd01.txt    -    saleofcarrotpatchfromelemerfuddtobugsbunnyon1april1931.txt

Between these two extremes, there are many acceptable file names:
fuddtobugs01.txt       F300-B50001.txt       fuddelmer-bunnybugs01.txt

The primary consideration for naming a file is the surname(s) that is (are) in the document.  Our users are searching for the surnames of their ancestors, not some non-descriptive term such as obit001.  File naming, if not using Soundex, should be alpha by last name for easier browsing. i.e. brown-j, and not j-brown for John Brown.  This insures that all of the files for a given surname are colocated, rather than being scattered throughout a directory.

There are times when using a surname as the basis for the file name is inappropriate.  A collection of deeds for Los Angeles County in the year 1931 would contain many surnames, in which case deeds1931.txt would be a good file name.

Some documents are about places, such as cemeteries.  The URIs and the FTP directories let users know where these places are located and what type of document it is, i.e., va/tazewell/cemeteries.  The file name must help the users know which cemetery.  For example, the Central Lutheran Church Cemetery could be named centluth.txt or centrallutheran.txt.  There would be no need to include the word cemetery in the file name, as the document has already been identified as a cemetery by the URI or FTP directory name.

By the same token, use of the words will, deed, obit, etc. in the file name is not necessary, as it is redundant.  Also, there is no need to precede a file name with a letter or other abbreviation to identify the category the document is in.  As an example, d-fudd01.txt for a Fudd deed, or o-fudd01 for a Fudd obituary is not necessary and should not be used.

Longevity. There is no reason not to believe that the USGenWeb Archives will still be going strong 100 years from now, with over 99 gigajillion bytes stored.  With any luck, Linda Lewis will still be the Project Coordinator.  That means that we must use a file naming system that accommodates long term use.

Over the period of a century, Elmer Fudd, Elmer Jr. and Elmer III might sale quite a few carrot patches to the Bunny family.  So, if you try to use combinations of the first and last names as the basis for a file name, you will eventually run out of possible combinations, especially for families with a lot of offspring, such as the Bunny family.  Some sort of alphanumeric file naming scheme is required for the long haul.  The file name fudd01.txt permits any number of transactions between the Fudds and the Bunnys, as does fudd-bunny01.txt.

File name length. For those of us with computers using a MicroSoft operating system, all pre-Windows 95 systems limited file names to an 8.3 file naming convention.  In the "8" part, you could use a combination of letters and numbers to name a file.  If you used more than eight characters, the file name was truncated to the first six characters plus a tilde and a sequence number.  So if you had used efuddtobbunny01.txt, the file name would have been truncated to efuddt~1.txt.  All of the newer operating systems permit long file names up to 255 characters in length.   While this may tempt your file naming virtuosity, you should consider keeping file names as short as possible while adequately describing the document contents.  Keep them short and simple.

If you are using an operating system that restricts your file names to the 8.3 convention, and you need a longer file name to adequately describe the document, there is a work around.  Use the file name you need in the TOC, and upload the file as you normally would.  Then, you can use your ftp program to rename the file on the remote site.

Letter case letters and characters. Lower case letters should be used for all folder and file names.  The file name Fudd01.txt is a different file from fudd01.txt  Likewise, fudd01.txt is a different file from fudd01.TXT.  Use all lower case letters.

Underlines and spaces should not be used in file names.  If you need to separate portions of the file name, use a hyphen, such as fudd-bugs15.txt.

The RootsWeb server will reject most other characters in file names.

Text or soundex. No discussion of file names would be complete without mentioning the use of Soundex names.  When the Archives first started, one of the things that was placed on a "Wish List" with the supporting RootsWeb staff was a search engine which could easily locate archived files by surname.  It was thought that using Soundex Codes would provide the basis for doing that.  As we have seen with our recently installed search engine, Soundex codes are not required to locate surnames in documents.  

Soundex converts a surname into an alphanumeric code, so that Fudd becomes F300, Grindstaff becomes G253, Russell becomes R240, and Washington becomes W252.  This is a handy feature for abbreviating a surname, and with the addition of a sequence number, a unique file name can be easily generated.  Soundex codes are used by most organizations which house huge data bases of surname files, such as the National Archives.  The problem is that many people do not know the Soundex code for their own surname, and most do not know the Soundex code for the many surnames in their ancestry.  Additionally, Soundex codes are not definitive.  The R240 Soundex code for Russell is also the Soundex code for at least 35 other surnames.  So the use of Soundex in a file name provides a quick, easy and standardized means for developing file names, but produces a somewhat ambiguous guide to the file contents.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Summary.
Pick file names which identify who or what is in a document.
Do not use ambiguous file names, e.g., obit01, will01.
Use an alphanumeric file naming scheme which will endure.
Keep file names short and descriptive of the content.
Use only lower case letters, numerals, and the hyphen character.
_____________________________________________________________________

References:
Guidelines for State File Managers/Archivists,
http://www.rootsweb.com/~usgenweb/guide2.htm

Archives-L msg from Linda Lewis dated 17 Jul 2001, Subject: Incorrect
File Names

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Questions?  Comments?  Please let us hear from you on ARCHIVES-L.





Return to the TOC Tips Index