Using the MMTSB Toolset for Cluster Analysis in AMBER

You'll need the MMTSB toolkit for this tutorial. This is not included in AMBER but is freely available via GitHub. We have discussed how to install it on Linux (Centos 7) in our earlier blog post: mmtsb-toolset-installation.

The following instructions are also provided in AMBER tutorials (https://ambermd.org/tutorials/basic/tutorial3/section6.htm, which includes scripts and commands for clustering. However, we've provided the quick tour of the steps we took to execute it in our study.

In this case, the K-means clustering algorithm is applied, which traverses the pathway in search of clusters of identical structures. It then generates centroids for each cluster and calculates the root mean square deviation (RMSD) of each structure in the trajectory with respect to each cluster.

1. Prepare the directories

  • mkdir clustering
  • cd clustering
  • mkdir PDBfit

2. Use cpptraj to extract multiple PDB's from the trajectory in the clustering folder

#first save this script: "extract_pdb.cpptraj"

trajin ../md100ns-autoimage.nc

strip :WAT # strip waters form the trajectory to avoid the segmentation core dump error

strip :Na+,Cl- # strip ions form the trajectory 

rms first mass @CA,C,N #aligning the trajectory on the protein backbone atoms

trajout PDBfit/pro.pdb multi pdb #write the multiple frames in the pdb format in the PDBfit folder.

# now perform the following steps in the clustering folder

  • source $AMBERHOME/amber.sh
  • $AMBERHOME/bin/cpptraj ../comp_solv.prmtop <extract_pdb.cpptraj

#this command extracts all the frames in the PDBfit folder.

3.  Adjust numbering of PDB files: save the following script in clustering folder as "fix_numbering_pdbs.sh". This script can be run optionally if you want to renumber the pdb files in the PDBfit folder e.g.,: pro.pdb.1 to pro.pdb.00001

#! /bin/csh
set ff = *.pdb.1
set fnam = $ff:r
set numfil = `ls -1 *.pdb.???? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.????)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.0$fnum
     echo $fnam $fr.0$fnum
  end
endif
set numfil = `ls -1 *.pdb.??? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.???)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.00$fnum
     echo $fnam $fr.00$fnum
  end
endif
set numfil = `ls -1 *.pdb.?? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.??)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.000$fnum
     echo $fnam $fr.000$fnum
  end
endif
set numfil = `ls -1 *.pdb.? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.?)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.0000$fnum
     echo $fnam $fr.0000$fnum
  end
endif

#run the following commands:
  • chmod +x fix_numbering_pdbs.sh
  • sh fix_numbering_pdbs.sh

Sometimes it gives a message like  "set: No match". It is fine, you may proceed to the next step.


4Perform clustering
  • cd PDBfit
  • ls -1 . > ../clustfils  #ls -1 means single column
  • vi ../clustfils
  • kclust -mode rmsd -centroid -cdist -heavy -lsqfit -radius 6 -maxerr 1 -iterate  ../clustfils > ../Centroid   #more options can be seen in /home/user/Downloads/toolset-master/perl/cluster.pl
  • cd ../

5. Process the Clusters

The centroids themselves have little physical significance because they are essentially mathematical constructs based on cluster members. The structure with the smallest RMSD to each centroid, on the other hand, is meaningful and much easier to comprehend. We may examine the structure that is closest to the centroid and represents each of the clusters obtained in this way.

To handle the Centroid file, we'll use the following awk script: "

extract_centroids.awk"

BEGIN{b0=2;}

{centind=index($1,"#Centroid");
   c=$2;
   getline;centind=index($0,"#Centroid");
   FIL0 = sprintf("centroid%2.2d.member.dat",c)
   while(centind != 1){
     print $1,$3 > FIL0 ;
     getline;centind=index($0,"#Centroid");
   }

   numcent=0;    print $2,$1,NR-b0; 
   c=$2;
   getline; endrec = index("End",$0);
   while( endrec != 1 ){
     FIL = sprintf("centroid%2.2d.pdb",c);
     print > FIL;
     getline; endrec = index($0,"#End");
   }
   b0=NR+2;
 }

Run the following command:
  • awk -f extract_centroids.awk centroid | tee centroid.stats
it will give the number of clusters formed along with the population or number of states/conformations in that cluster. Here for example we get 3 clusters.
1 #Centroid 8523
2 #Centroid 1477

P.S.: To obtain more clusters you may change the values of radius in the kclust command.

View the pdb files in VMD, but remember that they are similar to an average structure and do not represent a physically significant configuration; they are simply the cluster's centroid.

You can find the structure closest to the cluster centre (centroid) to show a more physically meaningful centroid structure. On the second column, sort the member file numerically (rms distance from the centroid).

>sort -n -k2 centroid01.member.dat | head -5

example output after sorting:
3210 1.106036
3545 1.113473
3205 1.155165
3214 1.157611
3209 1.159552

As a result, structure 3210 has the lowest RMSD to centroid1. As a result, we'll generate a copy of this to create our best member for this centroid. Our ID and pdb file are offset by one. Hence structure ID 3210 refers to pdb file pro.pdb.3211:

>cp PDBfit/pro.pdb.3211 centroid01_bestmember.pdb

Now examine this best member pdb for features of relevance, such as the H-bond analysis results. Repeat above for the other clusters.

6 Plot the RMS distances from the centroids

Each of the centroids' member files can be read into xmgrace or xmgr and plotted. Cleaning up the plot to make it easier to read will take some time. This can be accomplished by loading each member data set in the sequence listed below:

(the order of most populated):
centroid01.member.dat
centroid02.member.dat






This clustering study reveals some interesting trends. Take note of the movement from one cluster to the next, where red is the only structure that connects to black. x-axis denotes the frames.

By altering the clustering parameters and the range of structures across which the clustering is performed, one can perform a variety of alternative clusterings. Once an interesting part of the trajectory has been identified, clustering can be performed on that region by constructing a list of pdb files (clustfils) within that region.

Second, the radius for clustering can be changed; smaller radii result in more (smaller) clusters, which are typically unmanageable and insignificantly distinct from one another. However, without some trial and error, it is impossible to determine what a productive and informative radius value should be.

Finally, one can investigate the clustering of pdb files aligned in various ways. Our alignment was based on the first structure's protein residues. One could try aligning to a specific area of the structure, or to a different structure, such as the lowest energy or even one of the first clustering analysis' nearest members.

For any queries, kindly drop your message in the comments section below.

Please also provide your valuable feedback and stay tuned for more such blogs!!

Happy Computing!!




Comments

Popular posts from this blog

Calculation of RMSD values between two ligand poses using web servers/tools

AMBER TUTORIAL-2: How to simulate a protein- ligand system: basic steps for md simulation in amber (AMBER 16)

Python script to plot 3D FEL plots