模仿Google的Pregel:Apache Hama介绍 Apache Hama是一个纯BSP(Bulk Synchronous Parallel)计算框架,模仿了Google的Pregel。用来处理大规模的科学计算,特别是矩阵和图计算。 BSP概念由Valiant(2010图灵奖获得者)在1990年提出,具体参看 wikipedia 。Google在2009年发表了Pregel: A System for Large-Scale Graph Processing论文,在分布式条件下实现了BSP模型。 Hama安装 安装环境: OS: Ubuntu 12.04 64 JAVA: jdk1.6.0_30 Hadoop: hadoop-1.0.4 安装Hama之前,应该首先确保系统中已经安装了hadoop,我这里选用的目前最新版本hadoop-1.0.4。 第一步:下载并解压文件 hama的下载地址: http://mirror.bit.edu.cn/apache/hama/0.6.0/ 我这里选用北京理工的apache镜像。 解压文件到安装目录。我喜欢把hadoop和hama都安装在用户目录下,这样整个系统都比较干净。 tar -xvzf hama- 0.6 . 0 . tar .gz 第二步:修改配置文件 进入$HAMA_HOME/conf文件夹。 修改hama-env.sh文件。加入JAVA_HOME变量。 修改hama-site.xml文件。我的hama-site.xml配置文件如下: ?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl?configuration property namebsp.master.address/name valueLenovoE46a:40000/value descriptionThe address of the bsp master server. Either the literal string local or a host:port for distributed mode /description /property property namefs.default.name/name valueLenovoE46a:9000//value description The name of the default file system. Either the literal string local or a host:port for HDFS. /description /property property namehama.zookeeper.quorum/name valueLenovoE46a/value descriptionComma separated list of servers in the ZooKeeper Quorum. For example, host1.mydomain.com,host2.mydomain.com,host3.mydomain.com. By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper quorum servers. If HAMA_MANAGES_ZK is set in hama-env.sh this is the list of servers which we will start/stop zookeeper on. /description /property property namehama.zookeeper.property.clientPort/name value2181/value /property/configuration 解释一下,bsp.master.address参数设置成bsp master地址。fs.default.name参数设置成hadoop里namenode的地址。hama.zookeeper.quorum和hama.zookeeper.property.clientPort两个参数和zookeeper有关,设置成为zookeeper的quorum server即可,单机伪分布式就是本机地址。 第三步:运行Hama 首先启动Hadoop, % $HADOOP_HOME/bin/start-all. sh 再启动Hama % $HAMA_HOME/bin/start-bspd. sh 查看所有的进程,检查是否启动成功。 jps 第四步:运行例子程序 这里我们选用Pagerank例子程序。 首先上传数据到HDFS,数据的格式为: Site1\tSite2\tSite3Site2\tSite3Site3 执行Hama,其中/tmp/input/input.txt和/tmp/pagerank-output分别为输入文件和输出文件夹。 bin/hama jar ../hama- 0 .6. 0 -examples.jar pagerank /tmp/input/input.txt /tmp/pagerank-output 成功! http://www.cnblogs.com/DingaGa/archive/2012/12/16/2820331.html?utm_source=tuicool 1. pregel , http://wenku.baidu.com/view/20ae5729b4daa58da0114a96.html 2. 基于Hama平台的并行Finding a Maximal Independent Set 算法的设计与实现, http://blog.csdn.net/xin_jmail/article/details/32101483 3. 并行计算框架, http://wenku.baidu.com/view/b11522f7c8d376eeaeaa312e.html 4. GraphLab, http://wenku.baidu.com/view/3ed02d44af45b307e9719718.html 5. GraphLab A New Framework For Parallel Machine Learning uai2010, http://wenku.baidu.com/view/d358ba6cf5335a8102d22084.html
转自: http://www.nsc.liu.se/~pla/blog/2012/09/26/vaspkpar/ Testing the K-point Parallelization in VASP SEP 26 TH , 2012 VASP 5.3.2 finally introduced official support for k-point parallelization. What can we expect from this new feature in terms of performance? In general, you only need many k-points in relatively small cells, so up front we would expect k-point parallelization to improve time-to-solution for small cells with hundreds or thousands of k-points. We do have a subset of users at NSC, running big batches of these jobs, so this may be a real advantage in the prototyping stage of simulations, when the jobs are set up. In terms of actual job throughput for production calculations, however, k-point parallelization should not help much, as the peak efficiency is reached already with 8-16 cores on a single node. So let’s put this theory to test. Previously, I benchmarked the 8-atom FeH system with 400 k-points for this scenario. The maximum throughput was achieved with two 8-core jobs running on the same node, and the time-to-solution peaked at 3 minutes (20 jobs/h) with 16 cores on one compute node. What can k-point parallelization do here? KPAR is the new parameter which controls the number of k-point parallelized groups. KPAR=1 means no k-point parallelization, i.e. the default behavior of VASP. For each bar in the chart, the NPAR value has been individually optimized (and is thereby different for each number of cores). Previously, this calculation did not scale at all beyond one compute node (blue bars), but with KPAR=8 (purple bars), we can get close to linear (1.8x) speed-up going from 1 to 2 nodes, cutting the time-to-solution in half. As suspected, in terms of efficiency, the current k-point parallelization is not more efficient than the old scheme when running on a single node, which means that peak throughput remains the same at roughly 24 jobs/h per compute node. This is a little surprising, given that there should be overhead associated with running two job simultaneously on a node, compared to using k-point parallelization. What must be remembered, though, is that it is considerably easier to handle the file and job management for several sequential KPAR runs vs juggling several jobs per node with many directories, so in this sense, KPAR seems like a great addition with respect to workflow optimization. Posted by Peter Larsson Sep 26 th , 2012 转自: http://www.nsc.liu.se/~pla/blog/2012/11/26/vaspkpar2/ K-point Parallelization in VASP, Part 2 NOV 26 TH , 2012 Previously, I tested the k-point parallelization scheme in VASP 5.3 for a small system with hundreds of k-points. The outcome was acceptable, but less than stellar. Paul Kent (who implemented the scheme in VASP) suggested that it would be more instructive to benchmark medium to large hybrid calculations with just a few k-points, since this was the original use case, and consequently where you would be able to see the most benefit. To investigate this, I ran a 63-atom MgO cell with HSE06 functional and 4 k-points over 4 to 24 nodes: A suitable number of bands here is 192, so the maximum number of nodes we could expect to use with standard parallelization is 12, due to the fact that 12 nodes x 16 cores/node = 192 cores. And we do see that KPAR=1 flattens out at 1.8 jobs/h on 12 nodes. But with k-point parallelization, the calculation can be split into “independent” groups, each running on 192 cores. This enables us, for example, to run the job on 24 nodes using KPAR=2, which in this case translates into a doubling of speed (4.0 jobs/h), compared to the best case scenario without k-point parallelization. So there is indeed a real benefit for hybrid calculations of cells that are small enough to need a few k-points. And remember that in order for the k-point parallelization to work correctly with hybrids, you should set: NPAR = total number of cores / KPAR.
转自: http://cms.mpi.univie.ac.at/vasp/guide/node138.html The optimum setting of NPAR and LPLANE depends very much on the type of machine you are running. Here are a few guidelines SGI power challenge: Usually one is running on a relatively small number of nodes, so that load balancing is no problem. Also the communication band width is reasonably good on SGI power challenge machines. Best performance is often achived with LPLANE = .TRUE. NPAR = 1 NSIM = 1Increasing NPAR usually worsens performance. For NPAR=1 we have in fact observed a superlinear scaling w.r.t. the number of nodes in many cases. This is due to the fact that the cache on the SGI power challenge machines is relatively large (4 Mbytes); if the number of nodes is increased the real space projectors (or reciprocal projectors) can be kept in the cache and therefore cache misses decrease significantly if the number of nodes are increased. SGI Origin: The SGI Origin behaves quite different from the SGI Power Challenge. Mainly because the memory bandwidth is a factor of three faster than on the SGI Power Challenge. The following setting seems to be optimal when running on 4-16 nodes: LPLANE = .TRUE. NPAR = 4 NSIM = 4Contrary to the SGI Power Challenge superlinear scaling could not be observed, obviously because data locality and cache reusage is no only of minor importance on the Origin 2000. T3D, T3E On many T3D, T3E platforms one is forces to use a huge number of nodes. In that case load balancing problems and problems with the communication bandwidth are likely to be experienced. In addition the cache is fairly small on T3E and T3D machines so that it is impossible to keep the real space projectors in the cache with any setting. Therefore, we recommend to set NPAR on these machines to (explicit timing can be helpful to find the optimum value). The use of LPLANE = .TRUE. is only recommend if the number of nodes is significantly smaller than NGX, NGY and NGZ. In summary the following setting is recommended LPLANE = .FALSE. NPAR = sqrt(number of nodes) NSIM = 1 转自: http://www.nsc.liu.se/~pla/blog/2012/02/22/nparnsim/ Optimizing NPAR and NSIM FEB 22 ND , 2012 Our VASP users at NSC are sometimes asking about how to set NSIM and NPAR for good parallel performance in VASP. I wrote about NPAR before. But what about the NSIM parameter? The VASP manual says that NSIM=4 by default. It means that 4 bands are optimized in parallel in the RMM-DIIS algorithm, which allows VASP to exploit matrix-matrix BLAS operations instead of matrix-vector operations. But the NSIM/NPAR parameters should be adjusted based on actual underlying hardware (network, typ of processor, caches etc). Here are some results for the 24-atom PbSO4 cell running on a single Kappa compute node. Each bar in the chart below represents the average of three runs. It looks like NPAR and NSIM are largely indepedent factors, with NPAR being the most important one. Varying NPAR can give a performance boost of up to ca 50%, while varying NSIM gives about 10%. The internal variability between runs is less than 1% in this case, so the differences are real. We can conclude that NPAR=1 is optimal for a single-node job, as expected, and that NSIM=2 might be beneficial, instead of keeping the default NSIM. A more realistic example is a 128-atom Li2FeSiO4 supercell. This one we run on 4 nodes (32 cores) on Matter. It is a highly symmetric case with 512 bands. Like before, 3 runs were made for each data point. We find the best performance for NPAR=2/4, in line with previous results. But here, the default NSIM=4 setting seems to produce the worst performance, and the influence of NSIM is higher (up to +20% speed). The optimal choice seems to be NSIM=16. It is tempting to conjecture that NSIM should be increased even more for larger jobs. To investigate the upper limit of VASP jobs, let us look at the NiSi supercell with 504-atoms. It takes about 23 minutes to run a full SCF cycle on 32 Matter nodes. The outcome is not so encouraging, however: NSIM=16 does not deliver an increase in performance, and the influence of smaller NSIM values is dwarfed by other measurement errors. In this case, a likely culprit is the variation in MPI performance when running over many Infiniband switches. So for large jobs, NSIM seems to make less difference. You can leave it at default value. In conclusion: Use NPAR = 1 and NSIM = 2 for single-node jobs Use NPAR = nodes/2 (or nodes) and NSIM=2 for medium jobs. If you have enough bands per core and want to optimize, you can try NSIM=8/16 and see if it helps. Use NPAR = sqrt(nodes) and NSIM=4 for large jobs. 转自; http://www.nsc.liu.se/~pla/blog/2013/03/25/vaspabisko/ Test setup Here, we will be looking at the Li2FeSiO4 supercell test case with 128 atoms. I am running a standard spin-polarized DFT calculation (no hybrid), which I run to self-consistency with ALGO=fast. I adjusted the number of bands to 480 to better match the number of cores per node. Naive single node test A first (naive) test is to characterize the parallel scaling in a single compute node, without doing anything special such as process binding. This produced an intra-node scaling that looks like this after having tuned the NPAR values: Basically, this is what you get when you ask the queue system for 12,16,24,36,48 cores on 1 node with exclusively running rights (no other jobs on the same node), and you just launch VASP with srun $VASPin the job script. We see that we get nice intra-node scaling. In fact, it is much better than expected, but we will see in the next section that this is an illusion. The optimal choice of NPARturned out to be: 12 cores NPAR=1 16 cores NPAR=1 24 cores NPAR=3 36 cores NPAR=3 48 cores NPAR=6 This was also surprising, since I had expected NPAR=8 to be optimal. With these settings, there would be MPI process groups of 6 ranks which exactly fit in a NUMA zone. Unexpectedly, NPAR=6 seems optimal when using all 48 cores, and either NPAR=1 or NPAR=3 for the other cases. This does not fit the original hypothesis, but a weakness in our analysis is that we don’t actually know were the processes end up in this scenario, since there is no binding. The only way that you can get a symmetric communication pattern with NPAR=6 is to place ranks in a round robin scheme around each NUMA zone or socket. Perhaps this is what the Linux kernel is doing? An alternative hypothesis is that the unconventional choice of NPAR creates a load imbalance that may actually be beneficial because it allows for better utilization of the second core in each module. To explore this, I decided to test different binding scenarios. The importance of process binding To bind MPI processes to a physical core and prevent the operating system from moving them on around inside the compute node, you need to give extra flags to either srun or your MPI launching command such as mpirun. On Abisko, we use srun, where binding is controlled through SLURM by setting e.g. in the job script: srun --cpu_bind=rank ... This binds the MPI rank #1 to core #1, and so on in a sequential manner. It is also possible to explicitly specify where each rank should go. The following example binds 24 ranks to alternating cores, so that there is one rank running per Interlagos module: srun --cpu_bind=map_cpu=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 ... In this scheme, neighboring ranks are close to each other: i.e. rank #1 and #2 are in the same NUMA zone. The aim is to maximize the NUMA locality. The third type of binding I tried was to distribute the ranks in a round-robin scheme in steps of 6 cores. The aim is to minimize NUMA locality, since neighboring ranks are far apart from each other, i.e. rank #1 and #2 are in different NUMA zones. srun --cpu_bind=map_cpu=0,6,12,18,24,30,36,42,2,8,14,20,26,32,38,44,4,10,16,22,28,34,40,46 ... Below are the results when comparing the speed of running with 48 cores and 24 cores with different kinds of process binding. The 48 core runs are with NPAR=6 and the 24 cores with NPAR=3. It turns out that you can get all of the performance, and even more, by running with 24 cores in the “fat” mode. The trick is, however, that we need to enable the process binding ourselves. It does not happen by default when you run with half the number of cores per node (the “None” section in the graph). We can further observe that straight sequential process binding actually worsens performance in the 48 core scenario. Only in the round-robin NUMA scheme (“RR-NUMA”) can we reproduce the performance of the unbound case. This leads me to believe that running with no binding gets you in similar situation with broken NUMA locality, which explains why NPAR=3/6 is optimal, and not NPAR=4. The most surprising finding,however, is that the top speed was achieved not with the “alternate” binding scheme, which emphasizes NUMA memory locality, but rather with the round-robin scheme, which breaks memory locality of NPAR groups. The difference in speed is small (about 3%), but statistically significant. There are few scenarios where this kind of interleaving over NUMA zones is beneficial, so I suspect that it is not actually a NUMA issue, but rather related to memory caches. The L3 cache is shared between all cores in a NUMA zone, so perhaps the L3 cache is being trashed when all the ranks in an NPAR group are accessing it? It would be interesting to try to measure this effect with hardware counters… NSIM Finally, I also made some tests with varying NSIM: NSIM=4 is the default setting in VASP. It usually gives good performance in many different scenarios. NPAR=4 works on Abisko too, but I gained about 7% by using NPAR=8 or 12. An odd finding was that NPAR=16 completely crippled the performance, doubling the wall time compared to NPAR=4. I have no good explanation, but it obviously seems that one should be careful with too high NPAR values on Abisko. Conclusion and overall speed In terms of absolute speed, we can compare with Triolith, where one node with 16 cores can run this example in 380s (9.5 jobs/h) with 512 bands, using the optimal settings of NPAR=2 and NSIM=1. So the overall conclusion is that one Abisko node is roughly 20% faster than one Triolith node. You can easily become disappointed by this when comparing the performance per core, which is 2.5x higher on Triolith, but I think it is not a fair comparison. In reality, the performance difference per FPU is more like 1.3x, and if you compensate for the fact that the Triolith processors in reality run at much higher frequency than the listed 2.2 Ghz, the true difference in efficiency per core-GHz is closer to 1.2x. Hopefully, I can make some multi-node tests later and determine whether running with 24 cores per node and round-robin binding is the best thing there as well. Posted by Peter Larsson Mar 25 th , 2013
理论计算机科学中的“相似性原理” 《中国大百科全书》中理论计算机科学中的“相似性原理”,摘录如下: 这样,复杂性的问题有没有一个相对统一的标准呢?相似性原理解答了这一问题。按此原理,计算一个问题类使用的并行时间、空间和串行时间的复杂程度在所有理想的计算模型中都没有本质的差异。 用数学语言来说,各种模型不仅可以相互模拟而且模拟者所需的并行时间、空间和串行时间都分别不超过被模拟者需用的并行时间、空间和串行时间的一个多项式。 所以在只差一个多项式的范围内,复杂性还是有其客观依据而是不依赖于模型的。对于上面提到的各种模型,以及各种计算类型,相似性原理已被证明是正确的。 对于串行计算模型都可以定义一个虚拟的并行时间,叫做巡回。所谓巡回,是指计算过程中的周相数。而一个周相则是计算过程中的一个阶段,在此阶段中新计算出来记在工作空间上的信息不准在同一阶段内被读到。 例如,对于多带图灵机器而言,一个周相就是工作带头不改变方向的一段时间。所以巡回相当于工作带头改变方向的总次数。 因此可以说,一个问题类有快速的并行算法当且仅当它有一个具有小的巡回数(虚拟并行时间)的串行算法。另外并行时间和空间之间具有某些对称的性质。 例如可以证明,它们之间是多项式相关联的。 所以说,一个问题类有一个快速并行时间算法当且仅当它有一个高度节省内存的算法。 可是俺看不懂什么意思。 请您解释解释吧!真正的专家!谢谢! 该原理是谁证明的?原文是汉语还是英文?发表在哪里?现在哪里有准确详细的用英文介绍?等。 请您指导!真正的专家!谢谢! ————————————————————— 附录 ————————————————————— http://210.45.210.9:918/web/index.htm lilun jisuanji kexue 理论计算机科学 (卷名:数学) theoretical computer science 复杂性理论和算法分析 可计算性理论在逻辑上的进一步发展,就是计算复杂性理论。著名的图灵-丘奇论题说,凡是合理的计算模型都是等价的,即一个模型能计算的,在别的模型下也能计算,否则都不可计算。但是对于一个问题类而言,只知道能不能计算是很不够的,更有实际意义的是要知道计算需要多少时间,多大的内存等等。由此产生了计算复杂性理论和算法分析。 计算所需的时间、存储大小等等都算为资源。严格地讲,每一种资源的定义都依赖于一个计算模型。各种计算模型的资源的定义虽不一样,但是主要的有三种。 ① 串行时间 简称时间, 它是计算过程中的总运算量。也即把计算分成一些原始的步骤,完成这些步骤所需的总时间。 ② 空间 它是为了保存计算的中间结果所需地盘的大小。例如在计算时用一块黑板来打草稿,假定每一单位黑板面积可以写一个符号,且可以擦掉重写,空间就相当于打草稿所需的黑板面积。 ③ 并行时间 即并行计算所需的时间。也即多人或多机协同计算,解决一个问题所需要的时间。 复杂性总是对于一个特定的问题类来讨论的,其中包括无穷多个个别问题,有大有小。例如对矩阵乘法这一问题类而言,相对地说100阶矩阵相乘就是一个较大的问题,而二阶矩阵相乘则是个较小的问题。所以可以把阶 n 作为衡量问题大小的尺度。但是最一般地,是把输入的总长度 n 作为问题大小的尺度。因此,当给定一个算法以后,计算一个大小为 n 的问题所需的时间、空间等就可以表为 n 的函数。这个函数就叫做该算法的时间或空间的复杂性之度量,或称为复杂度。严格地讲,它是这个特定的问题类在某一特定计算模型中的某一算法下的复杂度。当要解决的问题越来越大时,时间、空间等资源的耗费将以什么样的速率增长,也即当 n 趋于无穷时这个函数增长的阶是什么?这就是复杂性理论所关心的问题。 在计算同一个问题类时,算法有好坏之别。例如要判定一个具有 n 个顶点的无向图中有没有回路,早期的算法所需的空间复杂度为 S ( n )= O (log 2 n ),但是后来设计了更精细的算法,它的空间复杂度只有 O (log n )。这说明 O (log 2 n )只是早期算法的空间复杂度,而并非这个问题本身的内在复杂度。或者说 O (log 2 n )只是回路问题空间复杂度的一个上界,而 O (log n )则是一个新的更好的上界。对于回路问题,任何算法都至少需要正比于log n 的工作空间。这就是说,对于任何算法,空间复杂度 S ( n )= Ω (log n )。因此可以认为 Ω (log n )是回路问题空间复杂度的一个下界。问题内在的复杂度介于上界和下界之间。在这个例子中,二者重合了。因此可以说回路问题的空间复杂度正比于log n ,记为 S ( n )=嘷 (log n )。 又如两个 n 位二进整数的乘法在多带图灵机模型下,一般的算法需时 O ( n 2 )。但改进的算法可以达到 O ( n 1.5 )。现在最好的算法可以达到 O ( n log n loglog n )。如果采用存储可修改机器做模型,则可以达到 O ( n )。可以看出一个问题的内在计算复杂性还因计算模型的不同而有高低。 较为重要而且具有特色的计算模型主要有多带图灵机器,多带多维多头图灵机器,硬件可修改机器(HMM),随机存取机器(RAM),并行随机存取机器(PRAM),向量机器,VLSI,等等。然而没有一个适合一切问题的统一的模型。这样,复杂性的问题有没有一个相对统一的标准呢?相似性原理解答了这一问题。按此原理,计算一个问题类使用的并行时间、空间和串行时间的复杂程度在所有理想的计算模型中都没有本质的差异。用数学语言来说,各种模型不仅可以相互模拟而且模拟者所需的并行时间、空间和串行时间都分别不超过被模拟者需用的并行时间、空间和串行时间的一个多项式。所以在只差一个多项式的范围内,复杂性还是有其客观依据而是不依赖于模型的。对于上面提到的各种模型,以及各种计算类型,相似性原理已被证明是正确的。 对于串行计算模型都可以定义一个虚拟的并行时间,叫做巡回。所谓巡回,是指计算过程中的周相数。而一个周相则是计算过程中的一个阶段,在此阶段中新计算出来记在工作空间上的信息不准在同一阶段内被读到。例如,对于多带图灵机器而言,一个周相就是工作带头不改变方向的一段时间。所以巡回相当于工作带头改变方向的总次数。因此可以说,一个问题类有快速的并行算法当且仅当它有一个具有小的巡回数(虚拟并行时间)的串行算法。另外并行时间和空间之间具有某些对称的性质。例如可以证明,它们之间是多项式相关联的。所以说,一个问题类有一个快速并行时间算法当且仅当它有一个高度节省内存的算法。 既然在多项式关联意义下复杂性不依赖于模型,就可以用 P 代表所有具有多项式时间算法的问题类;用 N P 代表所有具有非确定多项式时间算法的问题类;用 N C 代表所有能同时在对数多项式并行时间和多项式空间解决的问题类;等等。 一般认为, P 中的问题才具有现实的可计算性,但有许多实际问题属于 N P ,却未能找到一个确定型多项式时间算法。所以许多人猜想 P N P 。1971年库克在 N P 中找到一个问题类,叫做布尔合取范式的可满足性问题( S A T )。证明了 N P 中任何一个问题类的计算均可归约为 S A T 的计算。因此,称 S A T 为一个 N P 完全性问题(见 组合最优化 ), N P = P 当且仅当对 S A T 存在一个确定型多项式时间的算法。以后C.卡普等人又把 S A T 归约为许多其他的组合问题,得出了许多其他的 N P 完全性问题。目前 N P 完全性问题几乎涉及到一切数学的分支,形成了一个专门的理论。但是 N P 是否等于 P 的问题尚未解决。有人猜想它是独立于现有的数学公理系统的。完全性的研究不限于 N P 完全性。对于各个复杂性类,都有完全性问题。 在复杂性基本理论的基础上,对各类具体数学问题的复杂性研究构成了算法分析的主要内容,60年代以来取得了长足的进展。例如 n 维的快速傅里叶变换和 n 次多项式的乘法, 只需要 O ( n log n )次算术运算; n 位的数乘在多带图灵机上只需 O ( n log n loglog n )的时间;判定一个 n 位数是否为素数需时 O ( );在出度限定情况下,图同构的判定问题存在多项式时间的算法;判定整系数线性规划问题是否存在有理解,存在多项式时间算法; n 阶矩阵乘法只需4.7 n 2.8 次算术运算;后来又改进为 O ( n ),还证明了,这个阶可以无穷地改进下去,等等。 对比于上界的情况,下界的研究却碰到了许多困难,尤其是对一些具有实际意义的问题。例如对任何一个 N P 完全性问题,猜想至少需要指数的时间,但多年来连一个非线性的下界都证不出来。 除了计算的复杂性,还有描述的复杂性,这是 Α.Η.柯尔莫哥洛夫 和察廷提出来的。 一个01串 w 的信息量 I ( w )被定义为输出 w 的最小程序的长度。因为各个模型可以互相模拟,故以上的定义在只差一个常数值的意义下,是不依赖于模型的。因为任何一个数学公理系统所包含的信息量是有限的,因而在这个系统中不可能证出 I ( w )≥с形的定理,这里с是只依赖于公理系统的一个常数。但容易知道,除了有限多个 w 以外,所有的 w 都满足 I ( w )≥с(却无一能被证明)。这一结果简化和加强了哥德尔不完备性定理。 洪加威
计算机技术的快速发展,算法及相应软件的不断更新,使得当前利用我们手上的普通电脑来模拟相对较大的生物大分子体系和多聚体分子成为了可能,而且这种趋势会越来越明显,尤其是多核心CPU的出现及分子动力学模拟大规模并行化计算能力的提高,让从事生物学研究的人们有可能利用手中的个人电脑对感兴趣的蛋白分子进行有目的的模拟,并充分与生物学实验有机地结合在一起,这是一件非常有意思和好玩的事情。 尽管如此,许多生物大分子体系还是非常巨大的,比如我的一个体系:腺病毒六邻体(Hexon)三聚体蛋白,我想对其进行温控的分子动力学模拟,以动态分析其总表位构成、高温变性机制及高温变性在免疫原性上的反应。该体系共含有约940*3 = 2820个氨基酸残基,再加上一个立方体的水盒子,总体系约200,000个原子数,在QX9650四核心CPU,64位Linux系统下,Gromacs每模似10ns的时间要花上约27天,非常耗时,在一个约500ns的总体设计中,这种计算能力是无法忍受的。 GPU加速的Gromacs为我们带来了非常振奋的好消息,官方称利用Nvidia的CUDA技术,可以将MD模拟提高原单CPU的十倍以上,以下是我利用Nvidia GTX460 2G 进行分子动力学模拟的全过程,现拿出来和大家进行分享。 第一天: Nvida GTX460 2G大显存 按gmx网站( www. gromacs .org/gpu )上的说明,可以 模拟 200,000左右的 原子 ,我模拟的 体系则好 是190,000 硬件:Dell T3400工作站 X38主板 CPU: QX9650 4G ECC内存 GTX460 2G显存 软件 :Ubuntu9.10 64位, CUDA3.1, OpenMM2.0, FFTW3.2.2, CMake, Gromacs 4.5.1, 按照官网要求,独立安装CPU的Gromacs4.5.1(CMake 编译 ),再 下载 预编译好的 mdrun-gpu beta2 for Ubuntu9.10 64位 设好环境变量,运行~ 但是运行后,提示: Fatal error : reading tpx file (md.tpr) version 73 with version 71 program For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 大概的意思是说:预编译好的mdrun-gpu跑不了由4.5.1版 grompp 程序 生成的tpr 文件。 第二天: 采取从头编译的方法解决了上述问题,因为预编译好的mdrun-gpu与4.5.1里的程序版本号不同,所以会出现不兼容现象, 按照提示,顺利编译4.5.1版的mdrun-gpu成功, —————————————————————————— export OPENMM_ROOT_DIR=path_to_custom_openmm_installation cmake -DGMX_OPENMM=ON make mdrun make install-mdrun —————————————————————————— 但是新的问题来了, 运行出现错误提示: mdrun-gpu: error while loading shared libraries: libopenmm_api_wrapper.so: cannot open shared object file: No such file or directory 很奇怪! 环境变量也设好了,没有问题 在openmm目录下找不到libopenmm_api_wrapper.so文件 第三天: 我将操作系统换成RHEL5.5系统 再利用相同的安装方法,顺利解决上述问题, 但不明白其中原因,不过我想还是有办法解决的(先不管它)! 第四天: 总结一下我所遇到的问题,及解决办法: 1,版本问题 —————————————————————————— Fatal error: reading tpx file (md.tpr) version 73 with version 71 program For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 这里是说版本不兼容 2,Openmm不支持多组的温度耦合 —————————————————————————— Fatal error: OpenMM does not support multiple temperature coupling groups. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 3,不能按以前的mpd来设置 —————————————————————————— Fatal error: OpenMM uses a single cutoff for both Coulomb and VdW interactions. Please set rcoulomb equal to rvdw. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 4,GPU加速的gmx现在只支持amber力场及charmm力场 —————————————————————————— Fatal error: The combination rules of the used force-field do not match the one supported by OpenMM: sigma_ij = (sigma_i + sigma_j)/2, eps_ij = sqrt(eps_i * eps_j). Switch to a force-field that uses these rules in order to simulate this system using OpenMM. 5,GPU加速的gmx不支持G96里的 interaction ,实际上还是力场问题 —————————————————————————— Fatal error: OpenMM does not support (some) of the provided interaction type(s) (G96Angle) For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 6,在Ubuntu9.10里面用cmake编译gromacs4.5.1会遇到找不到libopenmm_api_wrapper.so文件的问题,换成RHEL5.5可以解决 —————————————————————————— error while loading shared libraries: libopenmm_api_wrapper.so: cannot open shared object file: No such file or directory 第五天: mdrun-gpu终于跑起来了,mdp文件是用的官网提供的 bench里面的,不过还是有一些warning: It is also possible to optimize the transforms for the current problem by performing some calcula- tions at the start of the run. This is not done by default since it takes a couple of minutes, but for large runs it will save time. Turn it on by specifying optimize_fft = yes WARNING: OpenMM does not support leap-frog, will use velocity-verlet integrator. WARNING: OpenMM supports only Andersen thermostat with the md/md-vv/md-vv-avek integrators. Pre-simulation ~15s memtest in progress...done, no errors detected starting mdrun 'Protein in water' 1000000 steps, 2000.0 ps. NODE (s) Real (s) (%) Time: 33.080 99.577 33.2 (Mnbf/s) (MFlops) (ns/day) (hour/ns) Performance: 0.000 0.074 47.609 0.504 gcq#330: "Go back to the rock from under which you came" (Fiona Apple) ———————————————————————————————— 最后这个Performance,有点看不懂,单从(ns/day) 这一点看,性能是l四核心CPU的五倍,但实际运行,性能仅是CPU的2倍, (MFlops) 一项,竟然是 0.074 CPU的 (MFlops) 是12GFlops 总体上看,GPU加速的GMX,性能提升,至少可以达到传统四核心CPU的2倍, imp模型官网上说可达到10倍以上,继续更新中。。。。。。 第六天: 190,000个原子的体系,共设了10ns,performance显示是5 ns/day,理论上两天就算完了, 实际上得到28号才算完(10月18号下午1点开始),这个结果和performance明显不符~~~ 5000000 steps, 10000.0 ps. step 417300, will finish Thu Oct 28 10:39:59 2010 /10月18号开始,显示10月28号结束 Received the TERM signal, stopping at the next step step 417378, will finish Thu Oct 28 10:39:46 2010 Post-simulation ~15s memtest in progress...done, no errors detected NODE (s) Real (s) (%) Time: 13633.960 71173.931 19.2 3h47:13 (Mnbf/s) (MFlops) (ns/day) (hour/ns) Performance: 0.000 0.003 5.290 4.537 gcq#47: "I Am Testing Your Grey Matter" (Red Hot Chili Peppers) 第七天: 相同的体系,相同的设置,以下是用QX9650 四核心CPU 跑的performance: 性能虽然比GTX460弱,但也只是多算了三天时间 —————————————————————————————— Back Off! I just backed up md.trr to ./#md.trr.1# Back Off! I just backed up md.edr to ./#md.edr.1# WARNING: This run will generate roughly 3924 Mb of data starting mdrun 'Good gRace! Old Maple Actually Chews Slate in water' 5000000 steps, 10000.0 ps. step 0 NOTE: Turning on dynamic load balancing step 500, will finish Mon Nov 1 09:57:48 2010vol 0.74 imb F 2% /10月19号早上开始,显示的是11月1号结束 Received the TERM signal, stopping at the next NS step step 550, will finish Mon Nov 1 10:08:34 2010 Average load imbalance: 2.4 % Part of the total run time spent waiting due to load imbalance: 1.1 % Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Y 0 % Parallel run - timing based on wallclock. NODE (s) Real (s) (%) Time: 123.856 123.856 100.0 2:03 (Mnbf/s) (GFlops) (ns/day) (hour/ns) Performance: 156.395 11.835 0.769 31.220 gcq#358: "Now it's filled with hundreds and hundreds of chemicals" (Midlake) 第八天: 跑官网上的bench: GTX460 的成绩是102ms/day,与c2050并没有想象的那么大差距! Pre-simulation ~15s memtest in progress...done, no errors detected starting mdrun 'Protein' -1 steps, infinite ps. step 285000 performance: 102.1 ns/day Received the TERM signal, stopping at the next step step 285028 performance: 102.1 ns/day Post-simulation ~15s memtest in progress...done, no errors detected NODE (s) Real (s) (%) Time: 481.290 482.224 99.8 8:01 (Mnbf/s) (MFlops) (ns/day) (hour/ns) Performance: 0.000 0.002 102.335 0.235 总结: 新一代GPU加速的Gromacs分子动力学模拟为我们展示了GPU将来在分子动力学领域应用美好前景,但目前还不成熟。从以上测试中我们可以看出在隐性溶剂水模型的MD模拟计算中,GPU加速的计算性能是传统四核心CPU的至少10倍以上,但是在显性溶剂水模型中,GPU加速未见得多么明显;另外最为重要的一点是,当前版本Gromacs 4.5.1 对于GPU加速MD计算有很多限制,如支持力场有限,许多特性还不支持,模拟的可重复性差(与CPU模拟相比)等,不过在足够长的模拟时间下,还是会生成重复性较好的具有统计学意义的模拟轨迹。相关资料请参考: www.gromacs.org/GPU
Gromacs 4.0.7并行带QM/MM安装 平台 SUSE Linux Enterprise Desktop 10 SP3 gcc4.1.2 mpich2-1.2.1p1 ifrot 10.1 fftw 3.2.2 解压缩mpich2-1.2.1p1.tar.gz进入此目录,运行: ./configure make make install 运行 touch /etc/mpd.conf chmod 700 /etc/mpd.conf 将下面加入mpd.conf: secretword=secretword (比如secretword=ltwd) 解压fftw3.2.2压缩后进入目录,安装到/soft/fftw下 ./configure --enable-float --enable-threads make make install 把libmopac.a复制到/soft/fftw/lib和/lib下 配置环境变量 setenv CPPFLAGS -I/soft/fftw/include setenv LDFLAGS -L/soft/fftw/lib 解压gromacs4.0.7,进入目录 ./configure --prefix=/soft/gromacs --enable-mpi -enable-fortran --with-qmmm-mopac --enable-shared make make install 配置环境变量 setenv LIBS -lmopac setenv LD_LIBRARY_PATH /soft/gromacs/lib source /soft/gromacs/bin/completion.csh set path=(/soft/gromacs/bin $path) configure中的其他选项 CC C compiler command 一般这个环境变量就是gcc CFLAGS C compiler flags 编译时的参数,一般是-O3 LDFLAGS linker flags, e.g. -Llib dir if you have libraries in a nonstandard directory lib dir 库文件目录 LIBS libraries to pass to the linker, e.g. -llibrary 设的时候不用-l xxx,无需引号 CPPFLAGS C/C++/Objective C preprocessor flags, e.g. -Iinclude dir if you have headers in a nonstandard directory include dir F77 Fortran 77 compiler command 一般这个环境变量就是gfortran或ifort FFLAGS Fortran 77 compiler flags 编译时的参数,一般是-O3 CCAS assembler compiler command (defaults to CC) CCASFLAGS assembler compiler flags (defaults to CFLAGS) CPP C preprocessor CXX C++ compiler command 一般是g++ CXXFLAGS C++ compiler flags CXXCPP C++ preprocessor XMKMF Path to xmkmf, Makefile generator for X Window System Optional Features: --disable-FEATURE do not include FEATURE (same as --enable-FEATURE=no) --enable-FEATURE include FEATURE --enable-shared build shared libraries --disable-float use double instead of single precision --enable-double same effect as --disable-float --enable-fortran use fortran (default on sgi,ibm,sun,axp) --enable-mpi compile for parallel runs using MPI --disable-threads don't try to use multithreading --enable-mpi-environment=VAR only start parallel runs when VAR is set --disable-ia32-3dnow don't build 3DNow! assembly loops on ia32 --disable-ia32-sse don't build SSE/SSE2 assembly loops on ia32 --disable-x86-64-sse don't build SSE assembly loops on X86_64 --disable-ppc-altivec don't build Altivec loops on PowerPC --disable-ia64-asm don't build assembly loops on ia64 --disable-cpu-optimization no detection or tuning flags for cpu version --disable-software-sqrt no software 1/sqrt (disabled on sgi,ibm,ia64) --enable-prefetch-forces prefetch forces in innerloops --enable-all-static make completely static binaries --disable-dependency-tracking speeds up one-time build --enable-dependency-tracking do not reject slow dependency extractors --enable-static build static libraries --enable-fast-install optimize for fast installation --disable-libtool-lock avoid locking (might break parallel builds) --disable-largefile omit support for large files Optional Packages: --with-PACKAGE use PACKAGE --without-PACKAGE do not use PACKAGE (same as --with-PACKAGE=no) --with-fft= FFT library to use. fftw3 is default, fftpack built in. --with-external-blas Use system BLAS library (add to LIBS). Automatic on OS X. --with-external-lapack Use system LAPACK library (add to LIBS). Automatic on OS X. --without-qmmm-gaussian Interface to mod. Gaussian0x for QM-MM (see website) --with-qmmm-gamess use modified Gamess-UK for QM-MM (see website) --with-qmmm-mopac use modified Mopac 7 for QM-MM (see website) --with-gnu-ld assume the C compiler uses GNU ld --with-pic try to use only PIC/non-PIC objects --with-tags include additional configurations --with-dmalloc use dmalloc, as in http://www.dmalloc.com/dmalloc.tar.gz --with-x use the X Window System --with-motif-includes=DIR Motif include files are in DIR --with-motif-libraries=DIR Motif libraries are in DIR --without-gsl do not link to the GNU scientific library, prevents certain analysis tools from being built --with-xml Link to the xml2 library, experimental