4.4 Principal component analysis (PCA)

Files for this tutorial (tutorial-4.4.tar.gz)

In this tutorial, we perform principal component analysis (PCA). PCA is useful to understand the dynamical behaviour of biomolecules by projecting the trajectory on small dimensions in terms of principal components (PCs). To perform PCA, we first use the eigmat_analysis tool, followed by the prjcrd_analysis tool, both of which are available in the GENESIS analysis tool sets. The former computes the eigenvalue and eigenvectors of the variance-covarience matrix, and the latter carries out a projection of the snapshot structures onto PC vectors.

In this tutorial, we use the output files obtained in tutorial 1.1 and tutorial 4.3. (Please work on these tutorials first if you have not done it yet.) Untar tutorial-4.4.tar.gz in the same directory as tutorial-1.1 and tutorial-4.3:

$ ls 
tutorial-1.1  tutorial-4.3
$ tar -zxf tutorial-4.4.tar.gz
$ ls
tutorial-1.1  tutorial-4.3  tutorial-4.4
$ cd tutorial-4.4
$ ls
run.sh  run1.inp  run2.inp

In tutorial-1.1, we carried out an MD simulation of a small protein, BPTI, in solution. The MD simulation was performed for 500,000 steps with crdout_period = 500, resulting in a trajectory file containing 1,000 snapshots. In tutorial-4.3, we analyzed RMSF of all residues in BPTI. Here, we calculate the PCs of BPTI and project all snapshots of the trajectory on to a plane of two PCs.

The control file, run1.inp, is for eigmat_analysis:

pcafile = ../tutorial-4.3/run2.pca # PCA file

valfile = run1.val   # VAL file
vecfile = run1.vec   # VEC file
cntfile = run1.cnt   # CNT file

We obtain valfile, vecfile, and cntfile, which contain eigenvalues, eigenvectors, and contribution of each mode to the overall dynamics, respectively. The following figure shows the plot of the accumulated contribution of each PC mode (4th column in cntfile).


The control file, run2.inp, for prjcrd_analysis is as follows:

reffile     = ../tutorial-1.1/1_setup/ionize.pdb    # PDB file
pdb_avefile = ../tutorial-4.3/run1_ave.pdb  # PDB file (Average coordinates)
pdb_aftfile = ../tutorial-4.3/run1_aft.pdb  # PDB file (Fitted Average coordinates)
valfile     = run1.val # VAL file
vecfile     = run1.vec # VEC file

prjfile     = run2.prj # PRJ file

In [INPUT] section, reffile is the reference structure used in tutorial-1.1, and pdb_avefile and pdbaftfile are the average coordinates of the analysis and fitting atoms, respectively, obtained in tutorial-4.3. valfile and vecfile are the output of eigmat_analysis.

[TRAJECTORY], [SELECTION], and [FITTING] sections, which follow the above, are the same as before. See tutorial-4.1 for details.

check_only     = NO       # (YES/NO)
vcv_matrix     = Global   # (GLOBAL/LOCAL)
num_pca        = 2        # # of principal components
analysis_atom  = 1        # analysis target atom group

In [OPTION] section, we specify the number of PCs to project the trajectory. The PCs are selected in a decreasing order of their contributions.

The analysis is executed by run.sh script:

$ cat run.sh

# set the number of OpenMP threads

# perform analysis with eigmat_analysis and prjcrd_analysis
$ eigmat_analysis run1.inp | tee run1.out
$ prjcrd_analysis run2.inp | tee run2.out

Now, let’s run the script:

$ chmod +x run.sh
$ ./run.sh
$ ls
run.sh    run1.cnt  run1.inp  run1.out  run1.val  
run1.vec  run2.inp  run2.out  run2.prj

We obtain run2.prj which contains the projection of each structure onto the calculated principal component axes. The i-th principal component μi of each snapshot is defined by the dot product:

Latex formula

where Latex formula  is the i-th eigenvector, q is the coordinates of each snapshot in the trajectory, and Latex formula is the
averaged coordinates. The following figure is the plot of 2nd and 3rd columns of run2.prj:

$ gnuplot 
gnuplot> set xlabel "PC1"
gnuplot> set ylabel "PC2"
gnuplot> set xrange [-1:1]
gnuplot> set yrange [-1:1]
gnuplot> set size square
gnuplot> plot "run2.prj" using 2:3



Written by Kiyoshi Yagi@RIKEN Theoretical molecular science laboratory
July, 17, 2016