Insight II



C       Setting Up the Brookhaven PDB

For Use with Insight II Software


PDB Organization

Getting Brookhaven PDB database

Refer to the Biosym Products System Guide, chapter 4 or 5 (depending on your installation media) for information on how to access the Brookhaven databank.

Coping with more than 8000 PDB files

The growth in the size of the Brookhaven Protein Data Bank (PDB) has accelerated enormously over the last year. As of mid-1998 the PDB, if copied to a hard disk without compression, requires about four gigabytes of disk space. As explained in Appendix I, Using Compressed Files, Biosym's software can now read compressed PDB files (and other types of compressed files such as archive, psv, grid files, etc.). The files must be compressed either with the standard UNIX compress command, or with the gzip program distributed by the Free Software Foundation. We recommend that you use gzip to compress your PDB files as the compression ratio is higher.

The Standard Brookhaven Directory Structure

Each CD from Brookhaven organizes the PDB entry files under a directory named distr (short for distribution). The distr directory contains many subdirectories, each of which has a two-character name. Each such subdirectory contains all PDB files having those two characters as the middle two of their four-character PDB identifiers. For example, the PDB entry 1crn can be found in the file distr/cr/pdb1crn.ent.

Using a Representative Selection of PDB Files

For users who do not need access to the full PDB, an alternative is to use only a representative sample of the PDB. Note that this strategy is not adequate for most users of the Homology product.

A good representative selection can be found in the list described in:

Hobohm U, Sander C (1994) Enlarged representative set of protein structures. Protein Sci 3:522-524

The latest version of this list can be obtained via anonymous ftp from:

ftp.embl-heidelberg.de (192.54.41.33)

directory: /pub/databases/protein_extras/pdb_select

Another is the sample of PDB files collected by the PDB Users' Group under the direction of professor Jane Richardson. It is part of the Protein Science CDROM, which is available from Cambridge University Press, 40 West 20th St., New York NY 10011-4211 USA. Order forms appear in the journal Protein Science. This list of selected filenames is also available directly from Brookhaven's anonymous ftp server at ftp.pdb.bnl.gov (internet address 130.199.144.1) in the user_group directory.


Installing the PDB for use with Insight II Software

Installing PDB Files

Biosym/MSI's software now supports the standard Brookhaven directory structure described in the preceding section. Most users should find this new method simpler and more convenient. To use this new method of organizing your PDB files, you must set the environment variable INSIGHT_PDB to point to a directory that contains one or more subdirectory trees that conform to the standard Brookhaven directory structure. This can be done at installation time (refer to the separate Biosym Products System Guide) or included in your local .cshrc file or even from within the Insight II program through the Session/Env_var command where you can set environment variables. The distr subdirectories can occur at any depth below INSIGHT_PDB, and there is no limit (other than available disk space) to the number of distr subdirectory trees you can have.

The pre-release 95.0 method of organizing your PDB files that required all PDB files to be in a single directory is still supported by Biosym/MSI's software (see The Shell Script Utility: link_pdb on page C-319), but dealing with more than 8000 files in a single directory can be overwhelming. Furthermore, copying or linking the files from the CDs into a single directory can be complicated.

If you wish, you can even use both organizing schemes, with some of your PDB files in the INSIGHT_PDB directory itself and others in two-character subdirectories under distr subdirectories.

Example of PDB setup for use with Biosym/MSI's software

The following example describes a typical procedure for installing the PDB for use by Biosym's software. It assumes that only one CDROM drive is available, that it is mounted on /cdrom, and that disk space is limited. To minimize the use of disk space, only two of the three Brookhaven CDs are copied onto the hard disk, and those files are compressed using the gzip program. The third Brookhaven CD is left in the CDROM drive, and its distr subdirectory tree will be made to appear under INSIGHT_PDB by means of a soft link.

In this example, the directory /usr/local/pdb will be used as the INSIGHT_PDB directory. Here is the procedure:

1.   Create the directory to which your environment variable INSIGHT_PDB will point:

> mkdir /usr/local/pdb

> setenv INSIGHT_PDB /usr/local/pdb

2.   Create one subdirectory under $INSIGHT_PDB for each distr tree to be created:.

> cd $INSIGHT_PDB

3.   Put the first Brookhaven CD into the CDROM drive. Copy the files under the distr subdirectory onto the hard disk:

> cp -r /cdrom/distr $INSIGHT_PDB/pdb1

4.   Compress the files just copied:

> gzip -r $INSIGHT_PDB/pdb1

5.   Repeat steps 3 and 4, for each Brookhaven CD.

Note: you may use compress instead, if gzip is not installed at your site. The gzip software can be obtained from the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA, or downloaded via ftp from the directory /pub/gnu/ at prep.ai.mit.edu.

6.   Add this definition of INSIGHT_PDB to your .cshrc file:

> setenv INSIGHT_PDB /usr/local/pdb

Note that this can also be done from within Insight II as well through the Session/Env_Var command.

The -r (recursive) option used with the cp and gzip commands in this example causes those commands to operate recursively on the specified directory and all files and subdirectories within them. These commands may take minutes or tens of minutes, depending on the speed of your hardware, when used on thousands of PDB files.

The gzip program is recommended for compression of PDB files because of its superior compression algorithm and because it can recursively compress all files in a directory tree with a single command. It compresses PDB files by a factor of about 3.8 on average. See Appendix I, Using Compressed Files for instructions on how to obtain gzip.

If you do not have the gzip program, you can compress your PDB files with the standard UNIX compress command. Unfortunately it cannot compress recursively, so you must combine it with the UNIX find command to compress all files in a subdirectory tree:

> find $INSIGHT_PDB -name "*.ent" -print -exec \
compress `{}' `;'

Keep the INSIGHT_PDB Directory Clean

The directory pointed to by the environment variable INSIGHT_PDB should not contain files other than PDB files or subdirectories other than those containing distr subdirectory trees. This is important because Biosym's software uses a UNIX find command to find all of the distr subdirectories. If you have superfluous files and subdirectories under INSIGHT_PDB, the find command wastes time searching them for distr subdirectories. This results in an unnecessary delay the first time you try to read in a PDB file.

The UNIX find command is not executed for each retrieval of a PDB file, but only for the first retrieval during a given execution of a program. The locations of the distr directories are then stored internally for subsequent retrievals of PDB files. They are discarded, and another find command is executed, only if the environment variable INSIGHT_PDB is changed while the software is running.

IBM Workstations Warning: Do Not Use Soft Links to distr Subdirectories

The find command discussed in the preceding section does not traverse soft links on an IBM workstation. This is not a problem if you copy and compress your PDB files as in the instructions beginning at step 3 of the installation instructions above.

You should not, however, create a link from a distr subdirectory of INSIGHT_PDB to the CDROM drive. The recommended procedure, if you have enough disk space, is simply to copy and compress all of your PDB files onto the hard disk. If you do not have enough disk space for this and must use the CDROM drive for part of your PDB database, then you should mount the CDROM drive directly on a subdirectory of INSIGHT_PDB.


Reading a PDB File into Insight II

Using the Molecule/Get Command

The Molecule/Get command can be used to read in a PDB file, specified by its file name, as in previous releases. As described above, the PDB files need not all be in one directory, but can instead be in subdirectory trees that conform to the standard Brookhaven PDB directory structure. All of these subdirectory trees must be under the directory pointed to by the environment variable INSIGHT_PDB. If your PDB files are organized in this way, their names do not appear in the File Name value-aid when the PDB Directory parameter is set to System. The Molecule/Get command can still find any PDB file, however, if you simply type its name into the File Name parameter.

For example, to read in crambin, enter pdb1crn.ent for the File Name parameter. You can also enter the four-character PDB identifier code. For example, to read in crambin you can enter 1CRN for the File Name parameter.

Using the File/Import Command

The File/Import command allows you to access PDB files not only by filename, but also by four indexed attributes such as resolution, author, source, and compound. This command requires specially processed index files, which are generated by separate utility programs. The index files supplied with Biosym's 95.0 release (pdb_author.index, pdb_source.index, pdb_resolu.index, and pdb_compound.index, all in the $BIOSYM/data/insight directory) were generated from the April 1995 release 71 of the PDB. These index files can be used with later versions of the PDB, but they obviously do not contain entries for files added since April 1995. You should therefore regenerate new index files when you install a new release of the PDB, as explained in Creating New Index Files for the File/Import Command on page C-316.

To read a PDB file using File/Import, select the Import command from the File pulldown and select PDB from the File Format list. When the boolean parameter Search is ON, the parameter Search_Criteria is active and the names of various PDB index files appear in the value-aid. For example, to search for a file submitted by W. B. Church, select pdb_author.index from the Index Files value-aid, then select CHURCH W.B. from the Pattern List value-aid. The name of the file then appears in the Files value-aid. Select the filename and then Execute the command.

To read a PDB file when you know its filename, turn the Search parameter OFF. A Files value-aid appears, listing the PDB files in your current directory. You can select one of these, or, to specify a PDB file that is in the system directories (under $INSIGHT_PDB), just type its name into the File_Name parameter and Execute the command. Like the Molecule/Get command, File/Import can also find the file even if you type only its four-character PDB identifier.

Creating New Index Files for the File/Import Command

A UNIX C-shell script, pdb_index_gen, exists in $BIOSYM/gifts/insight that allows you to regenerate the Biosym-format PDB index files used for searching PDB files with the File/Import command. Four index files derived from the April '97 release of PDB are installed in the 98.0 distribution tree, in the $BIOSYM/data/insight directory.

To run this script, type at the UNIX prompt:


>	$BIOSYM/gifts/insight/pdb_index_gen /cdrom1
where /cdrom1 points to a directory that contains one or more subdirectory trees that conform to the standard Brookhaven directory.

This command as typed generates four index files named:

pdb_resolu.index

pdb_author.index

pdb_compound.index

pdb_source.index

using respectively the following Brookhaven index files: resolu.idx, author.idx, compound.idx, and source.idx. These newly created files are read in when you start up Insight II from this local directory.

If you would like to update the 98.0 release tree with these new files, you need to copy them to the release tree. To do this, type:


>	cp pdb_resolu.index $BIOSYM/data/insight 
and so forth for each file. Note that you need to have write permission in the $BIOSYM/data/insight directory. Check with your system administrator if you do not.

Should you wish to regenerate only one specific file out of the four, you can do so by giving the -f argument to the script, as in:


>	$BIOSYM/gifts/insight/pdb_index_gen /cdrom1 -f resolu
As typed, this command only regenerates the pdb_resolu.index file. The specifier -f accepts the following options:

resolu for the resolution index file

author for the author index file

compound for the compound index file

source for the source index file.


Protein Loop Search (Biopolymer and Homology Modules)

Setting Up and Using the Protein Loop Search Commands

You can search a subset of the PDB for protein loops that match structural criteria using either the Protein/SearchLoop command in the Biopolymer module or the Loops/Search command in the Homology module. In both cases, the search is done using a precomputed database of alpha carbon distance matrices from a selected list of PDB files. The file of distance matrices is a binary data file named pdb_ca_distance.dat.

We have generated this file for you and installed it in the $BIOSYM/data/biopolymer directory of the current release distribution tree. This file was calculated from a subset of files in the April 1998 release of the PDB. The list of files used is in the file $BIOSYM/data/biopolymer/pdb_entry.dat. This list of entries is a subset of the representative selection of PDB files created by Hobohm and Sander (see Using a Representative Selection of PDB Files on page C-311). It comprises all files in the "25% list", from Hobohm and Sander's June 1998 version of their representative selection. To use the loop search commands with this database, you need to make sure that all of the PDB files listed in $BIOSYM/data/biopolymer/pdb_entry.dat are accessible within or beneath the directory pointed to by the environment variable INSIGHT_PDB, as explained in Installing the PDB for use with Insight II Software on page C-312.

You can skip the remainder of this section if you do not wish to replace the database of alpha carbon distance matrices that we have supplied.

If you wish to regenerate pdb_ca_distance.dat using a different list of PDB entries, here are the steps to follow:

1.   Create a selected list of high-resolution PDB files.

Create a text file containing the list of entries that you wish to use for protein loop searches. A sample file resides in $BIOSYM/data/biopolymer/pdb_entry.dat. It simply lists the names of the PDB files, one filename per line. If you wish, you can list only the four-character PDB identifiers instead of the complete filenames.

2.   Make sure your PDB files are accessible to pdb_find_distance.

The program pdb_find_distance creates the distance matrix file, and it requires that all the PDB files to be used for loop searching be accessible within or beneath the directory pointed to by the environment variable INSIGHT_PDB, as explained in Installing PDB Files on page C-312.

3.   At the UNIX prompt, run the pdb_find_distance utility program by typing:

> pdb_find_distance filename

where filename is the name of the file (created in step 1) containing your list of PDB filenames. The program prompts you for the name of the file if you do not type it on the command line.

The program writes on the screen the name of each PDB file it has processed. When finished reading the files, it creates the binary file pdb_ca_distance.dat. This file occupies about 18.9 MB of disk space when generated using the entries in $BIOSYM/data/biopolymer/pdb_entry.dat.

4.   Place pdb_ca_distance.dat into the proper directory.

To do this, execute this UNIX command:

> mv pdb_ca_distance.dat $BIOSYM/data/biopolymer

Be sure you have the necessary privileges to create files in this directory. You may need to log in as root to do this. If you do not have permission to create files in this directory, or if you just want temporarily to use a local version of pdb_ca_distance.dat, you can make your local version accessible to the Biopolymer and Homology modules by setting the environment variable $BIOPOLYMER_DATA to the directory that contains your local version of pdb_ca_distance.dat.


Sequence Database Searching (Homology Module)

Seq_extract and the Sequence Database File pdb.seq

The sequence database searching capability of the Homology product uses the FASTA algorithm to search for homologous sequences. One sequence database file, pdb.seq, is provided for you in the Homology gifts directory ($BIOSYM/gifts/homology). It contains the amino acid sequences of all proteins in the October 1998 release of the PDB. It is already installed for use by the commands in the Databases pulldown of the Homology module.

If you want to create a new pdb.seq from a different release of the PDB, follow the instructions for seq_extract in Appendix D of the Homology or Profiles-3D manual.

In addition to pdb.seq, a variety of other protein and nucleic acid sequence database files can be installed for use by the database searching commands in the Homology and Profiles-3D products. You must obtain these sequence database files from the sources listed in the Biosym Products System Guide, under the section in Chapter 3 titled Database Resources. If you download compressed sequence database files from the Internet, be careful to avoid files compressed on VMS computer systems. Although these files end in .Z extensions, they are not in the same format as the .Z files created by the UNIX compress command and cannot be used with Biosym's software.

To install additional sequence database files, follow the procedures explained in the Insight II Products System Guide, under the section entitled Install Sequence Database Searching Software for Homology and Profiles-3D.


3D Profiles Database (Profiles-3D Module)

Create_Profiles and the Database of 3D Profiles

Biosym includes a precalculated database of 3D profiles for use with the Find Structures command of the Profiles-3D product. These files are stored in the directory $BIOSYM/data/profiles_3d/database. These were derived from the representative selection of PDB files created by Hobohm and Sander (June 1998, 25% list). A complete set of profiles derived from the October 1998 PDB release is in $BIOSYM/gifts/profiles_3d/. You can update the database of 3D profiles using the Create_Profiles command of the Profiles-3D product. If you want to use a database of profiles in a location other than $BIOSYM/data/profiles_3d/database, you must set the environment variable PROFILE_DATA to the pathname of the directory that contains the profiles.


The Shell Script Utility: link_pdb

The purpose of the script link_pdb is to make PDB files on the CDROM drive accessible to any software that requires that all the PDB files be in one directory. Biosym's software no longer requires this, and we recommend that you use the standard Brookhaven directory structure instead.

The UNIX C-shell script link_pdb creates links from your current working directory to Brookhaven PDB data files on your CDROM drive. It can also copy the files themselves from the many subdirectories on the CD into a single directory on your hard disk.

A UNIX link can be used as if it were a real file. In fact, a link is a pointer to a file in a different location. Link_pdb creates links in a single directory (your current working directory) that point to PDB files in many different subdirectories (such as the ones on the Brookhaven CD). The advantage of doing this is that it saves considerable space on your hard disk and eliminates the effort involved in copying the files from the CDROM drive. The disadvantage is that reading the files is slower. As of April 1995, the complete PDB fills three CDs. If you want to access all of these through links, you will need three CDROM drives. Alternatively, if you only have one CDROM drive, you can copy all the files from two CDs, and then put the remaining CD into the CDROM drive and create links for it. You can use link_pdb with its -copy option to copy the files from the first two CDs.

To use link_pdb, you must first go to the directory in which you want to create the links. Then just type link_pdb at the UNIX prompt, followed by the name of the directory on which your CDROM drive is mounted. For example, if your CDROM is mounted on /cdrom, you would type:


>	link_pdb /cdrom
The links remain on your hard disk even if you remove the CD from the CDROM drive. If you try to read a PDB file when the CD is not in the drive, Insight II gives you an error message indicating that it cannot open the file. When you put the CD back in, the files are again readable.

If you have three CDROM drives mounted on /cdrom1, /cdrom2, and /cdrom3, you can put the Brookhaven CDs into them and make links for all of them at once:


>	link_pdb /cdrom1 /cdrom2 /cdrom3
If you have enough disk space, you can also use this command to copy the files themselves onto your hard disk, rather than making links. This is useful if you want to copy the first two CDs and make links to the third. To do this you must specify the option -copy on the command line. The full procedure to copy two CDs and link the third is:

1.   cd to the directory that will contain your PDB files and links. This is the directory that your environment variable INSIGHT_PDB points to.

2.   Put the first Brookhaven CD into the CDROM drive.

3.   Type this command at the UNIX command prompt:

> link_pdb -copy /cdrom

4.   Put the second Brookhaven CD into the CDROM drive.

5.   Type this command at the UNIX command prompt:

> link_pdb -copy /cdrom

6.   Put the third Brookhaven CD into the CDROM drive.

7.   Type this command at the UNIX command prompt:

> link_pdb /cdrom

If you do not have enough space on your hard disk to copy the files from the first two CDs, use the same procedure described above but omit the -copy options from the first two link_pdb command lines. You then have links for all the PDB files pointing to the CDROM drive. Since only one CD can be in the CDROM drive at any one time, those files on the other two CDs appear to be present but are unreadable. If you try to access one of them via Molecule/Get, Insight II gives you an error message indicating that the file cannot be read. You must then change CDs and try to read the file again. This approach may require frequent changing of the CDs, and is therefore inconvenient, but it does give you access to all of the latest PDB files. Note that if you set up your PDB directory in this way, you cannot run the pdb_find_distance utility, unless all of the files in pdb_entry.dat are first copied into your INSIGHT_PDB directory (see Setting Up and Using the Protein Loop Search Commands on page C-317).

Link_pdb can also be used to update your links or copies each time you get a new release of the PDB. To do this, just run link_pdb in the directory that contains your old links or copies. Link_pdb always replaces old copies with new ones. It replaces an old link with a new one only if the old link does not point to a valid location in the new PDB release. If an old link points to a file that has become obsolete (and therefore is not present in the latest PDB release), link_pdb does not delete the old link. Obsolete files therefore appear to be present but are unreadable. To be sure to eliminate these obsolete links, you must first delete all of your old links before creating new ones for your new PDB release. The easiest way to do this is to delete your PDB directory and its contents with the recursive remove command:


>	rm -r pdb_directory
You must then recreate your pdb directory, cd into it, and run link_pdb to create the new links. Be sure the environment variable INSIGHT_PDB points to the directory containing the links.




Last updated December 17, 1998 at 04:29PM PST.
Copyright © 1998, Molecular Simulations Inc. All rights reserved.