Methods and Theory

General overview of the methodology
The DNA model
Ensemble simulation using Monte Carlo
Generating DNA half-chains with Monte Carlo
Monte Carlo in DNA chain generation
Random Numbers and Gaussian Distributions
Joining half-chains
The "independent calculations" concept
Analyzing the results
Statistical analysis of the distribution functions
Calculation of the J-factor

1) General Overview of the Methodology

The DNA model

It is assumed that macroscopic structure of double stranded DNA can be modeled accurately as a single-chain, worm-like polymer with the position and direction of a single base pair defined by a coordinate reference frame (Levene and Crothers, 1983). The X and Y axes of this reference frame lie in the plane of the base pair with the Y axis lying along the long axis of the dinucleotide base pair. The Z-axis is normal to the plane of the base pair and parallel to the helix axis.

Three angular parameters (tilt, roll and twist) define the position and direction of one base pair relative to the next. The tilt angle (theta) is defined as a rotation about the X axis, the roll angle (phi) is about the Y axis and the twist angle (tau) is rotation about the Z axis. For instance, defining these angles between two base pairs in a stretch of perfectly straight B-form DNA would be: tilt=0, roll=0 and twist=36 (assuming a helical repeat of exactly 10 bp).

Thus, a single DNA chain of N base pairs in length can be described by 3*(N-1) parameters defining the tilt, roll and twist angles between each base pair, assuming the first base pair has a known fixed position.

Ensemble simulation using Monte Carlo

Each of the three angles for each base pair may be allowed to undergo thermal motions. These thermal motions are modeled as a gaussian distribution about the average values: To simulate an ensemble of DNA chains, monte carlo methods are employed to generate many single chains. This approach to simulating DNA cyclization rates by Monte Carlo works in the following manner. First, a series of "half-chains" are generated using Monte Carlo to determine the tilt, roll and twist between each base pair in the DNA chain using the C program generate_chains. Second, these half-chains are joined to form a complete "whole-chain" and analyzed for meeting the defined criteria of a cyclized DNA (see below for how these criteria are defined) using the C program analyze_chains.

2) Generating DNA half-chains with Monte Carlo

This program takes as input a "sequence.in" file, a random number seed, the number of half-chains to generate and the number of chain partitions to simulate.

The sequence.in file (which can be generated by the CYCLIZE PERL subroutines) contains six values per base pair step describing the chain's tilt, tilt_flex, roll, roll_flex, twist and twist_flex. Each of these values is in degree units. The flexibility parameters describes the range of the thermally induced gaussian fluctuations for each of the base pair's tilt, roll and twist angles.

The program reads this information, and chops the chain into two partitions (ie: a 156 nt DNA becomes two partitions from 1-78 bps and 79-156 bps). The Monte Carlo routine "generate_chain" (found in cyclize_lib.c) is then called a "number of half-chain" times for each partition. The half-chains generated for the partitions are then stored in binary form in the files "chain_1.dat" and "chain_2.dat".

Finally, a report is generated called 'generate_chains.header", which contains gobs of information on how the chain generation was performed.

Monte Carlo in DNA chain generation

CYCLIZE assumes that a DNA chain can be modeled dynamically by assigning six free parameters to each "step" between base pairs (bp). The first three parameters are tilt, roll and twist, which define the average Euler angles the next bp makes WRT the previous bp.

tilt: = angle about the X-axis
roll: = angle about the Y-axis
twist: = angle about the Z-axis (perpendicular to the plane of the base pair)

For instance, if you set tilt=0, roll=0 and twist=36, this is defining that the next bp is, on average, planar with the previous bp (tilt=0 and roll=0) but has a 36 degree twist about the Z-axis (ie: B-form DNA has an approx 36 degree twist from one base pair to the next).

Random Numbers and Gaussian Distributions

Two random number generators (RNG) have been coded into CYCLIZE. The first is from the kind folks at Numerical Recipies called "ran2()" and the second is from the kind folks from the Jorgensen lab at Yale University (JM Briggs and WL Jorgensen) called "ranu()". Both functions can be found in the source code file "random.c". CPU utilization benchmarks show that both routines have nearly the same speed, so either is a good choice. CYCLIZE by default uses the Jorgensen ranu function because, why not, if it is good enough for them it is good enough for us!

Three functions that return random numbers with a gaussian distribution (a normal distribution, in stats speak) have been coded into CYCLIZE and all can be found in the source code file named "math_routines.c". The first function is called gauss and calculates a normal distribution by averaging 12 random numbers from 0-1, this routine is VERY slow, due to the large number of calls to the RNGs. The second is called gauss_box_muller and is from the Numerical Recipies people using the Box-Muller algorithm (see page 289 of "Numerical Recipies in C", 2nd ed.). The third routine is called gauss_dev2 was found on the internet somewhere. The gauss_box_muller routines makes only two calles to the RNG on average (is this always true?) and is the fastest of the routines and is thus what is used by default for CYCLIZE.

3) Joining half-chains

The "independent calculations" concept

Setting the number of independent calculations to 100 increases the reproducibility of the resultant J factor with only a minor increase in CPU time due to the extra chain generations.

4) Analyzing the results

Statistical analysis of the distribution functions

Below is a quick description of the methods used in fitting the distribution functions calculated by the analyze_chains program. See the C source file fit.c for most of the functions decribed in this section.

Each distribution function is stored seperately every 10% of the calculation, allowing for the calculation of the "summed" distribution function and the standard deviation of each point in this summed distribution function. All discussions below are concerning this "summed" distribution function.

Radial Distribution

The radial distribution function is stored as an integer array in which each position (or "bin" in stats speak) of the array represents an additional 1 angstrom end-to-end distance for the chain. Thus, the 6th bin of the distribution array will count the number of chains with an end-to-end distance between 5.0-6.0 angstroms.

After the simulation is finished, each bin of the distribution array is converted to a "molarity" distribution function, rather than a "counting" distribution function. This is accomplished by dividing the "fractional moles of chains" by the "shell volume". The best way to explain this is to demonstrate the actual formula.

Variable definitions

f           = fractional # of chains in the current bin
chains_bin  = number of chains in the current bin
nR_analyzed = number of chains stored in all bins
moles_chain = fractional moles of a chain in the current bin
volume_bin  = volume of the current bin
Na          = Avagadro's number
R1          = smallest radius allowed in the current bin
R2          = largest radius allowed in the current bin
W_bin       = "molarity" of the current bin

Equations used:

f          = chains_bin / nR_analyzed
moles_bin  = f / Na
volume_bin = 4/3 * PI * [R2^3 - R1^3]
W_bin      = moles_bin / volume_bin

The final radial "molarity" distribution function is fit by a even powered quadratic function: Y=A + BX^2 + CX^4 using Single Value Decomposition. Only the first 30% of the function is used in the fitting, since we ultimately only want to know what the Y intercept is, there is no need to fit points far from X=0.

Axial Distribution

The axial distribution function is fit by a even powered quadratic function: Y=A + BX^2 + CX^4 using Single Value Decomposition.

Torsional Distribution

The torsional distribution function is fit by a gaussian function: Y=A * exp[ - ( (x-B)/(1.414*C) )^2 ] using the Leveberg-Marquardt methodology.

Calculation of the J-factor

Jon Lapham
Yale University
Department of Chemistry - Crothers Lab