科学网 › 标签 › 并行

标签: 并行

相关帖子	版块	作者	回复/查看	最后发表

没有相关内容

相关日志

并行图处理框架的资料收集: htsong1976 2015-7-1 09:43; 一般介绍并行计算框架.pdf 分布式图形实验：机器学习框架以及云技术中的数据挖掘.doc pentaho介绍， http://wenku.baidu.com/view/0aa7ae2f31126edb6f1a10c9.html Pregel pregel_翻译.doc SIGMOD2010-pregel-a_system_for_large-scale_graph_processing.pdf Hama ApacheHama-0.2_UserGuide.pdf GraphLab Graphlab并行集群安装教程.docx GraphLab_A_New_Framework_For_Parallel_Machine_Learning_uai2010.pdf; 个人分类: 复杂网络|2719 次阅读|0 个评论

[转载]并行图处理框架：Apache Hama介绍: htsong1976 2015-7-1 09:34; 模仿Google的Pregel：Apache Hama介绍 Apache Hama是一个纯BSP（Bulk Synchronous Parallel）计算框架，模仿了Google的Pregel。用来处理大规模的科学计算，特别是矩阵和图计算。 BSP概念由Valiant（2010图灵奖获得者）在1990年提出，具体参看 wikipedia 。Google在2009年发表了Pregel: A System for Large-Scale Graph Processing论文，在分布式条件下实现了BSP模型。 Hama安装安装环境： OS: Ubuntu 12.04 64 JAVA: jdk1.6.0_30 Hadoop: hadoop-1.0.4 安装Hama之前，应该首先确保系统中已经安装了hadoop，我这里选用的目前最新版本hadoop-1.0.4。第一步：下载并解压文件 hama的下载地址： http://mirror.bit.edu.cn/apache/hama/0.6.0/ 我这里选用北京理工的apache镜像。解压文件到安装目录。我喜欢把hadoop和hama都安装在用户目录下，这样整个系统都比较干净。 tar -xvzf hama- 0.6 . 0 . tar .gz 第二步：修改配置文件进入$HAMA_HOME/conf文件夹。修改hama-env.sh文件。加入JAVA_HOME变量。修改hama-site.xml文件。我的hama-site.xml配置文件如下： ?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl?configuration property namebsp.master.address/name valueLenovoE46a:40000/value descriptionThe address of the bsp master server. Either the literal string local or a host:port for distributed mode /description /property property namefs.default.name/name valueLenovoE46a:9000//value description The name of the default file system. Either the literal string local or a host:port for HDFS. /description /property property namehama.zookeeper.quorum/name valueLenovoE46a/value descriptionComma separated list of servers in the ZooKeeper Quorum. For example, host1.mydomain.com,host2.mydomain.com,host3.mydomain.com. By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper quorum servers. If HAMA_MANAGES_ZK is set in hama-env.sh this is the list of servers which we will start/stop zookeeper on. /description /property property namehama.zookeeper.property.clientPort/name value2181/value /property/configuration 解释一下，bsp.master.address参数设置成bsp master地址。fs.default.name参数设置成hadoop里namenode的地址。hama.zookeeper.quorum和hama.zookeeper.property.clientPort两个参数和zookeeper有关，设置成为zookeeper的quorum server即可，单机伪分布式就是本机地址。第三步：运行Hama 首先启动Hadoop， % $HADOOP_HOME/bin/start-all. sh 再启动Hama % $HAMA_HOME/bin/start-bspd. sh 查看所有的进程，检查是否启动成功。 jps 第四步：运行例子程序这里我们选用Pagerank例子程序。首先上传数据到HDFS，数据的格式为： Site1\tSite2\tSite3Site2\tSite3Site3 执行Hama，其中/tmp/input/input.txt和/tmp/pagerank-output分别为输入文件和输出文件夹。 bin/hama jar ../hama- 0 .6. 0 -examples.jar pagerank /tmp/input/input.txt /tmp/pagerank-output 成功！ http://www.cnblogs.com/DingaGa/archive/2012/12/16/2820331.html?utm_source=tuicool 1. pregel , http://wenku.baidu.com/view/20ae5729b4daa58da0114a96.html 2. 基于Hama平台的并行Finding a Maximal Independent Set 算法的设计与实现， http://blog.csdn.net/xin_jmail/article/details/32101483 3. 并行计算框架， http://wenku.baidu.com/view/b11522f7c8d376eeaeaa312e.html 4. GraphLab， http://wenku.baidu.com/view/3ed02d44af45b307e9719718.html 5. GraphLab A New Framework For Parallel Machine Learning uai2010， http://wenku.baidu.com/view/d358ba6cf5335a8102d22084.html; 个人分类: 复杂网络|2815 次阅读|0 个评论

Bash脚本实现批量作业并行化: 热度 1 Jerkwin 2013-12-16 03:27; Bash 脚本实现批量作业并行化 2013-12-14 21:26:02 在 Linux 下运行作业时 , 经常会遇到以下情形 : 有大量作业需要运行 , 完成每个作业所需要的时间也不是很长 . 如果我们以串行方式来运行这些作业 , 可能要耗费较长的时间 ; 若采用并行方式运行则可以大大节约运行时间 . 再者 , 目前的计算机绝大部分都是多核架构 , 要想充分发挥它们的计算能力也需要并行化 . 总结网上看到的资料 , 利用 Bash 脚本 , 可以采用下面几种方法实现批量作业的并行化 . 注意 , 下面论述中将不会区分进程和线程 , 也不会区分并行和并发 . 1. 采用 GNU 的 paralle 程序 parallel 是 GNU 专门用于并行化的一个程序 , 对于简单的批量作业并行化非常合适 . 使用 parallel 不需要编写脚本 , 只需在原命令的基础上简单地加上 parallel 就可以了 . 所以 , 如果能用 paralle 并行化你的作业 , 请优先使用 . 有关 paralle 的详细说明 , 请参考其官方文档 . 2. 最简单的并行化方法 :+wait 利用 Bash 的后台运行和 wait 函数 , 可实现最简单的批量作业并行化 . 如下面的代码 , 串行执行大约需要 10 秒改为下面的简单并行代码理想情况下可将运行时间压缩到 3 秒左右 3. 进程数可控的并行化方法 (1): 模拟队列使用 Bash 脚本同时运行多个进程并无困难 , 主要存在的问题是如何控制同时运行的进程数目 . 上面的简单并行化方法使用时进程数无法控制 , 因而功能有限 , 因为大多数时候我们需要运行的作业数远远超过可用处理器数 , 这种情况下若大量作业同时在后台运行 , 会导致运行速度变慢 , 并行效率大大下降 . 一种简单的解决方案就是模拟一个限定最大进程数的队列 , 以进程 PID 做为队列元素 , 每隔一定时间检查队列 , 若队列中有作业完成 , 则添加新的作业到队列中 . 这种方法还可以避免由于不同作业耗时不同而产生的无用等待 . 下面是根据网上的代码改写的一种实现 . 实用性更强的代码 , 请参考原文 . 一个更简洁的方法是记录 PID 到数组 , 通过检查 PID 存在与否以确定作业是否运行完毕 . 可实现如下 3. 进程数可控的并行化方法 (2): 命名管道上面的并行化方法也可利用命名管道来实现 , 命名管道是 Linux 下进程间进行通讯的一种方法 , 也称为先入先出 (fifo,first in first out) 文件 . 具体方法是创建一个 fifo 文件 , 作为进程池 , 里面存放一定数目的令牌 . 作业运行规则如下 : 所有作业排队依次领取令牌 ; 每个作业运行前从进程池中领取一块令牌 , 完成后再归还令牌 ; 当进程池中没有令牌时 , 要运行的作业只能等待 . 这样就能保证同时运行的作业数等于令牌数 . 前面的模拟队列方法实际就是以 PID 作为令牌的实现 . 据我已查看的资料 , 这种方法在网络上讨论最多 . 实现也很简洁 , 但理解其代码需要的 Linux 知识较多 . 下面是我改写的示例代码及其注释 . 注意 : (1) exec6$Pfifo 这一句很重要 , 若无此语句 , 向 $Pfifo 写入数据时 , 程序会被阻塞 , 直到有 read 读出了文件中的数据为止 . 而执行了此语句 , 就可以在程序运行期间不断向文件写入数据而不会阻塞 , 并且数据会被保存下来以供 read 读出 . (2) 当 $Pfifo 中已经没有数据时 ,read 无法读到数据 , 进程会被阻塞在 read 操作上 , 直到有子进程运行结束，向 $Pfifo 写入一行 . (3) 核心执行部分也可使用如下方式 {} 和 () 的区别在 shell 是否会衍生子进程 (4) 此方法在目前的 Cygwin( 版本 1.7.27) 下无法使用 , 因其不支持双向命名管道 . 有人提到一个解决方案 , 使用两个文件描述符来替代单个文件描述符 , 但此方法我没有测试成功 . 参考资料 : 1. 简洁的模拟队列方法实现 shell 里如何实现多线程 ? http://blog.linuxeden.com/html/55/t-164155.html 2. 模拟队列方法 , 实用性更强的代码 A srcipt for running processes in parallel in Bash http://pebblesinthesand.wordpress.com/2008/05/22/a-srcipt-for-running-processes-in-parallel-in-bash/ 3. 命名管道方法及其解释 bash 实现多进程 http://www.cnitblog.com/sysop/archive/2008/11/03/50974.aspx SHELL 模拟多线程脚本的详细注解 http://findingcc.blog.51cto.com/1045158/287417 一段相当精彩的 shell 脚本（模拟多线程） http://www.blogbus.com/luobeng-logs/123290553.html Linux 下模拟多线程的并发并发 shell 脚本 http://www.centoscn.com/shell/2013/0731/823.html shell 脚本模拟多进程 http://www.phpfans.net/article/htmls/201009/MzA3ODEy.html shell 并发脚本学习 http://raocl.wordpress.com/tag/mkfifo/ 4. Cygwin 下双向命名管道的问题 bi-directional named pipe http://cygwin.com/ml/cygwin/2009-07/msg00081.html; 个人分类: 我的工具箱|15518 次阅读|1 个评论

IDL使用IDL_IDLbridge进行“多线程”运算: 热度 2 reasonzhang 2013-11-26 12:08; 当使用 IDL （ version=6.3 ）时，可以利用 IDL_IDLbridge 进行多线程（并行）处理。主要针对大量互不影响的函数运算。在有多个核的服务器上尤为有效。整体思路为: 创建类名为IDL_IDLBridge的对象,该对象作为一个新的idl子进程,通过execute属性将使用的程序在这个进程里编译过后,并运行该程序.通过该对象status属性,检测已完成上一次任务的进程,并分配新的任务,直至所有任务都被执行过.通过setvar和getvar属性在母进程和子进程之间传递参数. 下面是示例程序写法，以及适当说明： ;20131124 RUIXIONG ZHANG multi-processes controller ; 本示例目的在于使用指定线程数，以指定参数，执行特定函数。 ; 给定使用的线程数量，如果过大，则会出问题。 threads=8;the number of the processes used ; 给定所使用的函数地址，进行编译之用 Function_position='/ozone_nox_voc.pro' ; 给定所使用的函数，进行引用 Function_name='ozone_nox_voc' ; 创建一个对象数组，用以代表线程 p=objarr(threads) ; 用以记录每个线程正在进行的任务 p_task=fltarr(threads);thread's last task position ; 任务池的大小 task_num=50*50;tasks number ; 任务池里下一个任务的编号 new_task=0;next task tracer ; 输出结果 output=fltarr(50,50) output(*,*)=!values.f_nan ;parameteres initialization ; 任务池里各个任务所代表的参数 parameter=fltarr(2,50,50) FOR i=0,49 do parameter(0,*,i)=indgen(50)*0.1+0.1 FOR i=0,49 do parameter(1,*,i)=indgen(50)*0.1+0.1 ;initialization, assign the first jobs to the threads FOR j=0,threads-1 do begin ; 将 p(j) 创建为 IDL_IDLBridge 类的对象，也就是赋予其一个子 idl 进程 p(j)=OBJ_NEW('IDL_IDLBridge') temp=array_sub(50,50,new_task);get the position of a new task voc_str=string(parameter(0,temp ,temp )) nox_str=string(parameter(1,temp ,temp )) ; 记录赋予该进程的任务 p_task(j)=new_task;this is my task! new_task++ ; 设定子进程里的参数（这个是函数所在位置） p(j)-Setvar,function,Function_position ; 在子进程里编译该函数 p(j)-Execute,.compile function ; 将函数结果赋给子进程里的 maxium 变量， /nowait 代表此命令后母进程直接执行下一条命令。 p(j)-Execute,maxium=+Function_name+(+voc_str+,+nox_str+),/nowait ENDFOR signal=intarr(threads); threads' status managernew_task=0 print,'Successfully initialized and waiting for signal' t0=systime(1) While (1 gt 0) do begin ; 查看每个子进程的状况， 1 为忙， 0 为闲 for i=0,threads-1 do signal(i)=p(i)-Status() ; 如果所有子进程都空闲，且任务池没有任务，则跳出 while 循环 if (new_task eq task_num)and(total(signal)eq 0) then break;all tasks are done pos=where(signal eq 0,pos_count);idle threads ; 对空闲子进程赋予新的任务 FOR i=0,pos_count-1 do begin thread_idle=pos(i) temp=array_sub(50,50,p_task(thread_idle));my last task position output(temp ,temp )=p(thread_idle)-Getvar('maxium') print,'Thread '+string(thread_idle,format='(i)')+' finishes task:'+string(temp ,format='(i)')+' '+string(temp ,format='(i)') print,'Result: '+string(output(temp ,temp )) ; 如果最后一个任务已经被他子进程执行，则不作为 if (new_task eq task_num) then continue;no tasks left temp=array_sub(50,50,new_task) p_task(thread_idle)=new_task;my new task num! voc_str=string(parameter(0,temp ,temp )) nox_str=string(parameter(1,temp ,temp )) new_task++ p(j)-Execute,maxium=+Function_name+(+voc_str+,+nox_str+),/nowait dt=systime(1)-t0 print,'Moves to task:'+string(temp ,format='(i)')+' '+string(temp ,format='(i)') print,'Elapse time: '+string(dt/60,format='(i)')+' min' print,'Assigned/Unassigned task: '+string(new_task,format='(i)')+'/'+string(task_num,format='(i)') print,'*******************************************************************' print,'' ENDFOR ENDwhile print,'Computation is completed!' save,output,filename='/temp.sav' print,'All saved! Work done!'; 个人分类: 经验|10432 次阅读|5 个评论

[转载]VASP并行设置---KPAR参数: ywmucn 2013-11-15 08:40; 转自： http://www.nsc.liu.se/~pla/blog/2012/09/26/vaspkpar/ Testing the K-point Parallelization in VASP SEP 26 TH , 2012 VASP 5.3.2 finally introduced official support for k-point parallelization. What can we expect from this new feature in terms of performance? In general, you only need many k-points in relatively small cells, so up front we would expect k-point parallelization to improve time-to-solution for small cells with hundreds or thousands of k-points. We do have a subset of users at NSC, running big batches of these jobs, so this may be a real advantage in the prototyping stage of simulations, when the jobs are set up. In terms of actual job throughput for production calculations, however, k-point parallelization should not help much, as the peak efficiency is reached already with 8-16 cores on a single node. So let’s put this theory to test. Previously, I benchmarked the 8-atom FeH system with 400 k-points for this scenario. The maximum throughput was achieved with two 8-core jobs running on the same node, and the time-to-solution peaked at 3 minutes (20 jobs/h) with 16 cores on one compute node. What can k-point parallelization do here? KPAR is the new parameter which controls the number of k-point parallelized groups. KPAR=1 means no k-point parallelization, i.e. the default behavior of VASP. For each bar in the chart, the NPAR value has been individually optimized (and is thereby different for each number of cores). Previously, this calculation did not scale at all beyond one compute node (blue bars), but with KPAR=8 (purple bars), we can get close to linear (1.8x) speed-up going from 1 to 2 nodes, cutting the time-to-solution in half. As suspected, in terms of efficiency, the current k-point parallelization is not more efficient than the old scheme when running on a single node, which means that peak throughput remains the same at roughly 24 jobs/h per compute node. This is a little surprising, given that there should be overhead associated with running two job simultaneously on a node, compared to using k-point parallelization. What must be remembered, though, is that it is considerably easier to handle the file and job management for several sequential KPAR runs vs juggling several jobs per node with many directories, so in this sense, KPAR seems like a great addition with respect to workflow optimization. Posted by Peter Larsson Sep 26 th , 2012 转自： http://www.nsc.liu.se/~pla/blog/2012/11/26/vaspkpar2/ K-point Parallelization in VASP, Part 2 NOV 26 TH , 2012 Previously, I tested the k-point parallelization scheme in VASP 5.3 for a small system with hundreds of k-points. The outcome was acceptable, but less than stellar. Paul Kent (who implemented the scheme in VASP) suggested that it would be more instructive to benchmark medium to large hybrid calculations with just a few k-points, since this was the original use case, and consequently where you would be able to see the most benefit. To investigate this, I ran a 63-atom MgO cell with HSE06 functional and 4 k-points over 4 to 24 nodes: A suitable number of bands here is 192, so the maximum number of nodes we could expect to use with standard parallelization is 12, due to the fact that 12 nodes x 16 cores/node = 192 cores. And we do see that KPAR=1 flattens out at 1.8 jobs/h on 12 nodes. But with k-point parallelization, the calculation can be split into “independent” groups, each running on 192 cores. This enables us, for example, to run the job on 24 nodes using KPAR=2, which in this case translates into a doubling of speed (4.0 jobs/h), compared to the best case scenario without k-point parallelization. So there is indeed a real benefit for hybrid calculations of cells that are small enough to need a few k-points. And remember that in order for the k-point parallelization to work correctly with hybrids, you should set: NPAR = total number of cores / KPAR.; 个人分类: VASP|9429 次阅读|0 个评论

[转载]VASP的并行设置---NPAR及NSIM参数: ywmucn 2013-11-15 08:14; 转自： http://cms.mpi.univie.ac.at/vasp/guide/node138.html The optimum setting of NPAR and LPLANE depends very much on the type of machine you are running. Here are a few guidelines SGI power challenge: Usually one is running on a relatively small number of nodes, so that load balancing is no problem. Also the communication band width is reasonably good on SGI power challenge machines. Best performance is often achived with LPLANE = .TRUE. NPAR = 1 NSIM = 1Increasing NPAR usually worsens performance. For NPAR=1 we have in fact observed a superlinear scaling w.r.t. the number of nodes in many cases. This is due to the fact that the cache on the SGI power challenge machines is relatively large (4 Mbytes); if the number of nodes is increased the real space projectors (or reciprocal projectors) can be kept in the cache and therefore cache misses decrease significantly if the number of nodes are increased. SGI Origin: The SGI Origin behaves quite different from the SGI Power Challenge. Mainly because the memory bandwidth is a factor of three faster than on the SGI Power Challenge. The following setting seems to be optimal when running on 4-16 nodes: LPLANE = .TRUE. NPAR = 4 NSIM = 4Contrary to the SGI Power Challenge superlinear scaling could not be observed, obviously because data locality and cache reusage is no only of minor importance on the Origin 2000. T3D, T3E On many T3D, T3E platforms one is forces to use a huge number of nodes. In that case load balancing problems and problems with the communication bandwidth are likely to be experienced. In addition the cache is fairly small on T3E and T3D machines so that it is impossible to keep the real space projectors in the cache with any setting. Therefore, we recommend to set NPAR on these machines to (explicit timing can be helpful to find the optimum value). The use of LPLANE = .TRUE. is only recommend if the number of nodes is significantly smaller than NGX, NGY and NGZ. In summary the following setting is recommended LPLANE = .FALSE. NPAR = sqrt(number of nodes) NSIM = 1 转自： http://www.nsc.liu.se/~pla/blog/2012/02/22/nparnsim/ Optimizing NPAR and NSIM FEB 22 ND , 2012 Our VASP users at NSC are sometimes asking about how to set NSIM and NPAR for good parallel performance in VASP. I wrote about NPAR before. But what about the NSIM parameter? The VASP manual says that NSIM=4 by default. It means that 4 bands are optimized in parallel in the RMM-DIIS algorithm, which allows VASP to exploit matrix-matrix BLAS operations instead of matrix-vector operations. But the NSIM/NPAR parameters should be adjusted based on actual underlying hardware (network, typ of processor, caches etc). Here are some results for the 24-atom PbSO4 cell running on a single Kappa compute node. Each bar in the chart below represents the average of three runs. It looks like NPAR and NSIM are largely indepedent factors, with NPAR being the most important one. Varying NPAR can give a performance boost of up to ca 50%, while varying NSIM gives about 10%. The internal variability between runs is less than 1% in this case, so the differences are real. We can conclude that NPAR=1 is optimal for a single-node job, as expected, and that NSIM=2 might be beneficial, instead of keeping the default NSIM. A more realistic example is a 128-atom Li2FeSiO4 supercell. This one we run on 4 nodes (32 cores) on Matter. It is a highly symmetric case with 512 bands. Like before, 3 runs were made for each data point. We find the best performance for NPAR=2/4, in line with previous results. But here, the default NSIM=4 setting seems to produce the worst performance, and the influence of NSIM is higher (up to +20% speed). The optimal choice seems to be NSIM=16. It is tempting to conjecture that NSIM should be increased even more for larger jobs. To investigate the upper limit of VASP jobs, let us look at the NiSi supercell with 504-atoms. It takes about 23 minutes to run a full SCF cycle on 32 Matter nodes. The outcome is not so encouraging, however: NSIM=16 does not deliver an increase in performance, and the influence of smaller NSIM values is dwarfed by other measurement errors. In this case, a likely culprit is the variation in MPI performance when running over many Infiniband switches. So for large jobs, NSIM seems to make less difference. You can leave it at default value. In conclusion: Use NPAR = 1 and NSIM = 2 for single-node jobs Use NPAR = nodes/2 (or nodes) and NSIM=2 for medium jobs. If you have enough bands per core and want to optimize, you can try NSIM=8/16 and see if it helps. Use NPAR = sqrt(nodes) and NSIM=4 for large jobs. 转自； http://www.nsc.liu.se/~pla/blog/2013/03/25/vaspabisko/ Test setup Here, we will be looking at the Li2FeSiO4 supercell test case with 128 atoms. I am running a standard spin-polarized DFT calculation (no hybrid), which I run to self-consistency with ALGO=fast. I adjusted the number of bands to 480 to better match the number of cores per node. Naive single node test A first (naive) test is to characterize the parallel scaling in a single compute node, without doing anything special such as process binding. This produced an intra-node scaling that looks like this after having tuned the NPAR values: Basically, this is what you get when you ask the queue system for 12,16,24,36,48 cores on 1 node with exclusively running rights (no other jobs on the same node), and you just launch VASP with srun $VASPin the job script. We see that we get nice intra-node scaling. In fact, it is much better than expected, but we will see in the next section that this is an illusion. The optimal choice of NPARturned out to be: 12 cores NPAR=1 16 cores NPAR=1 24 cores NPAR=3 36 cores NPAR=3 48 cores NPAR=6 This was also surprising, since I had expected NPAR=8 to be optimal. With these settings, there would be MPI process groups of 6 ranks which exactly fit in a NUMA zone. Unexpectedly, NPAR=6 seems optimal when using all 48 cores, and either NPAR=1 or NPAR=3 for the other cases. This does not fit the original hypothesis, but a weakness in our analysis is that we don’t actually know were the processes end up in this scenario, since there is no binding. The only way that you can get a symmetric communication pattern with NPAR=6 is to place ranks in a round robin scheme around each NUMA zone or socket. Perhaps this is what the Linux kernel is doing? An alternative hypothesis is that the unconventional choice of NPAR creates a load imbalance that may actually be beneficial because it allows for better utilization of the second core in each module. To explore this, I decided to test different binding scenarios. The importance of process binding To bind MPI processes to a physical core and prevent the operating system from moving them on around inside the compute node, you need to give extra flags to either srun or your MPI launching command such as mpirun. On Abisko, we use srun, where binding is controlled through SLURM by setting e.g. in the job script: srun --cpu_bind=rank ... This binds the MPI rank #1 to core #1, and so on in a sequential manner. It is also possible to explicitly specify where each rank should go. The following example binds 24 ranks to alternating cores, so that there is one rank running per Interlagos module: srun --cpu_bind=map_cpu=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 ... In this scheme, neighboring ranks are close to each other: i.e. rank #1 and #2 are in the same NUMA zone. The aim is to maximize the NUMA locality. The third type of binding I tried was to distribute the ranks in a round-robin scheme in steps of 6 cores. The aim is to minimize NUMA locality, since neighboring ranks are far apart from each other, i.e. rank #1 and #2 are in different NUMA zones. srun --cpu_bind=map_cpu=0,6,12,18,24,30,36,42,2,8,14,20,26,32,38,44,4,10,16,22,28,34,40,46 ... Below are the results when comparing the speed of running with 48 cores and 24 cores with different kinds of process binding. The 48 core runs are with NPAR=6 and the 24 cores with NPAR=3. It turns out that you can get all of the performance, and even more, by running with 24 cores in the “fat” mode. The trick is, however, that we need to enable the process binding ourselves. It does not happen by default when you run with half the number of cores per node (the “None” section in the graph). We can further observe that straight sequential process binding actually worsens performance in the 48 core scenario. Only in the round-robin NUMA scheme (“RR-NUMA”) can we reproduce the performance of the unbound case. This leads me to believe that running with no binding gets you in similar situation with broken NUMA locality, which explains why NPAR=3/6 is optimal, and not NPAR=4. The most surprising finding,however, is that the top speed was achieved not with the “alternate” binding scheme, which emphasizes NUMA memory locality, but rather with the round-robin scheme, which breaks memory locality of NPAR groups. The difference in speed is small (about 3%), but statistically significant. There are few scenarios where this kind of interleaving over NUMA zones is beneficial, so I suspect that it is not actually a NUMA issue, but rather related to memory caches. The L3 cache is shared between all cores in a NUMA zone, so perhaps the L3 cache is being trashed when all the ranks in an NPAR group are accessing it? It would be interesting to try to measure this effect with hardware counters… NSIM Finally, I also made some tests with varying NSIM: NSIM=4 is the default setting in VASP. It usually gives good performance in many different scenarios. NPAR=4 works on Abisko too, but I gained about 7% by using NPAR=8 or 12. An odd finding was that NPAR=16 completely crippled the performance, doubling the wall time compared to NPAR=4. I have no good explanation, but it obviously seems that one should be careful with too high NPAR values on Abisko. Conclusion and overall speed In terms of absolute speed, we can compare with Triolith, where one node with 16 cores can run this example in 380s (9.5 jobs/h) with 512 bands, using the optimal settings of NPAR=2 and NSIM=1. So the overall conclusion is that one Abisko node is roughly 20% faster than one Triolith node. You can easily become disappointed by this when comparing the performance per core, which is 2.5x higher on Triolith, but I think it is not a fair comparison. In reality, the performance difference per FPU is more like 1.3x, and if you compensate for the fact that the Triolith processors in reality run at much higher frequency than the listed 2.2 Ghz, the true difference in efficiency per core-GHz is closer to 1.2x. Hopefully, I can make some multi-node tests later and determine whether running with 24 cores per node and round-robin binding is the best thing there as well. Posted by Peter Larsson Mar 25 th , 2013; 个人分类: VASP|18846 次阅读|0 个评论

Fluent 多节点并行UDF: nanzhang 2013-3-21 19:03; 非盘阵情况下，UDF需拷贝至每一个节点同路径目录下。拷贝文件夹格式：scp -r libudf node****:/****/**** /; 个人分类: Solver(Fluent)|6904 次阅读|0 个评论

[转载]Matlab并行运算: haijunwang 2013-2-25 15:51; 原帖地址： http://blog.sina.com.cn/s/blog_73c4359601012qmi.html 今天搞了一下matlab的并行计算，效果好的出乎我的意料。本来CPU就是双核，不过以前一直注重算法，没注意并行计算的问题。今天为了在8核的dell服务器上跑程序才专门看了一下。本身写的程序就很容易实现并行化，因为beamline之间并没有考虑相互作用。等于可以拆成n个线程并行，要是有550核的话，估计1ms就算完了。。。先转下网上找到的资料。一、 Matlab 并行计算原理梗概 Matlab 的并行计算实质还是主从结构的分布式计算。当你初始化 Matlab 并行计算环境时，你最初的 Matlab 进程自动成为主节点，同时初始化多个（具体个数手动设定，详见下文） Matlab 计算子节点。 Parfor 的作用就是让这些子节点同时运行 Parfor 语句段中的代码。 Parfor 运行之初，主节点会将 Parfor 循环程序之外变量传递给计算子节点。子节点运算过程时互不干扰，运算完毕，则应该有相应代码将各子节点得到的结果组合到同一个数组变量中，并返回到 Matlab 主节点。当然，最终计算完毕应该手动关闭计算子节点。二十六、初始化 Matlab 并行计算环境这里讲述的方法仅针对多核机器做并行计算的情况。设机器的 CPU 核心数量是 CoreNum 双核机器的 CoreNum2 ，依次类推。 CoreNum 以不等于核心数量，但是如果 CoreNum 小于核心数量则核心利用率没有最大化，如果 CoreNum 大于核心数量则效率反而可能下降。因此单核机器就不要折腾并行计算了，否则速度还更慢。下面一段代码初始化 Matlab 并行计算环境： %Initialize Matlab Parallel Computing Enviornment by Xaero | Macro2.cn CoreNum=2; % 设定机器 CPU 核心数量，我的机器是双核，所以 CoreNum=2 if matlabpool('size')=0 % 判断并行计算环境是否已然启动 matlabpool('open','local',CoreNum); % 若尚未启动，则启动并行环境 else disp('Already initialized'); % 说明并行环境已经启动。 end 运行成功后会出现如下语句： Starting matlabpool using the 'local' configuration ... connected to 2 labs. 如果运行出错，按照下面的办法检测：首先运行： matlabpool size 如果出错，说明你没有安装 Matlab 并行工具箱。确认安装了此工具箱后，运行： matlabpool open local 2; 如果出错，证明你的机器在开启并行计算时设置有问题。请联系 MathWorks 的售后服务。二十七、终止 Matlab 并行计算环境用上述语句启动 Matlab 并行计算环境的话，在你的内存里面有 CoreNum 个 Matlab 进程存在，每个占用内存都在百兆以上。（可以用 Windows 任务管理器查看），故完成运行计算后可以将其关闭。关闭的命令很简单： matlabpool close 二十八、 Matlab 做 Monte Carlo 并行的算法 Matlab 并行计算比较特别。下图节选自 Matlab 并行计算工具箱用户手册。这个列表告诉你 Matlab 如何处理 Parfor 并行计算程序段中的各种变量。所以写代码时要注意不少问题，否则写出的并行代码可能还不如非并行的代码快。这里我推荐大家用 Matlab 写 Monte Carlo 并行代码时按照以下注意事项来写： 1．将 Monte Carlo 模拟过程中不会改变的参数都写在 Parfor 循环块外面 2．生成随机数、计算 f(x) 等过程都写在 Parfor 里面 3．不要将 V0 结果传递出 Parfor ，而是直接计算出 V0 的均值、方差传递出 parfor 。 4．最后用数学公式将传递出 Parfor 的 V0 的均值方差组合计算成最终结果这些事项如何体现到程序中请参照示例代码文件并结合视频教程学习。这样的并行办法简单易行，对原始程序没有太大的改动，同时传递变量耗费时间也较少，效率比较高。另外一个问题就是并行代码做模拟的次数问题。我们要达到用非并行的代码做 N 此模拟所能得到结果的精确程度，在核心为 CoreNum 并行代码中， Parfor 语句段中只要做 N/CoreNum 次即可达到。二十九、将例子改写为并行代码附件中的 pareg1.m ，……， pareg5.m 五个文件分别是前一章五个例子的并行代码。这里需要提到的是，这五个代码文件都是用向量化的代码编写。原因在于，在前一章大家都看到了，向量化的代码比循环语句代码一般快几十甚至上千倍，所以要提高速度，向量化代码是最重要的优化方法，并行计算倒是其次。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 由于要搜索多核运行，找到这个帖子里来了刚才试了一下，我使用的MATLAB2010可以多核运行的。需要多核多线程跑的算法，在之前要让matlab在本地建立4个“实验室”（我的机器是4核，所以是4个） matlabpool local 4 Starting matlabpool using the 'local' configuration ... connected to 4 labs. 显示正在进行多核配置，一会说，连接到4个“实验室”。我理解就是在本地虚拟出4台可以运行matlab的工作站，这样用分布式计算工具箱可以进行并行计算（matlabpool这个命令好像是在并行计算工具箱里的）。观察windows任务管理器，可以发现一共有5个MATLAB.exe进程。其中一个占内存较多的，我理解是主控的，他基本不干活，只负责分配，进行计算时他的cpu占用率只有1~2%，剩下四个进程专门用来计算的，跑起来各占cpu 25%左右。看上去还是每个matlab进程单核运算，但是一下开4个进程，所以能把cpu用满。如果后续还需要多核运算，就直接用parfor好了，不用每次都用matlabpool命令。那个配置一次就好。算完了，不再跑了，临退出时关闭配置就行。 matlabpool close Sending a stop signal to all the labs ... stopped. 下面是我一个M文件的程序，测测4核并行计算和单核计算的差距，很简单。 function testtime runtimes = 1e9; dummy1 = 0; dummy2 = 0; %matlabpool local 4 tic %for x= 1:runtimes; parfor x= 1:runtimes; dummy1 = dummy1 + x; dummy2 = 2 * x + 1; end toc plot( , ); 第一次用普通for语句，单核跑，6.09秒 testtime Elapsed time is 6.094267 seconds. 第二次用parfor语句，4核跑，1.63秒 matlabpool local 4 Starting matlabpool using the 'local' configuration ... connected to 4 labs. testtime Elapsed time is 1.631350 seconds. matlabpool close 加速比 6.09 / 1.63 = 3.736，将近4倍（还有开销吧），还比较可观。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 然后说一下要注意的几个问题： 1、parfor效果好，但是用起来要注意程序的细节。很多地方都会报错。比如下标必须为连续的整数！否则会报下面这个错误“The variable xxx in a parfor cannot be classified.”具体可以看parfor帮助文件里面的sliced variables这一节，帮助建议仔细全部看过最好。 2、用了parfor之后，输出参数用nargout确定，会出错。不知道为什么。 3、变量在parfor内外的传进传出要非常小心，因为并行的关系，依赖循环下标的变量都要仔细考虑。在我的程序里面，用profile监测，不用并行计算的时候，CPU时间为 84.742 S，用了并行计算的时候，CPU时间为 15.231 S 加速比达到了5.6！！！！！！ Oh my lady gaga！！！！！！双核E6400，不知道为什么加速比这么恐怖. 明天在xeon 5310上面去试试，双CPU，一共8核，不知道会是什么样子，估计最多1S。。。。; 个人分类: matlab编程|3181 次阅读|0 个评论

[转载]matlab分布式或并行式计算经验: haijunwang 2013-2-25 15:27; Matlab Parallel（Distributed）Computing Toolbox初探zz 我的实验室有五台双核Pentium D 925计算机，这正适合用来做分布式或并行式计算。我打算只调用那些计算机中的一个核参与计算，留下一个核可以让其他人正常地使用该计算机。我们在这里将会介绍Matlab中Distributed Computing Toolbox的基本使用方法，目标是实现简单的分布式计算。 Distributed Computing Toolbox就是分布式计算工具箱，简称DCT，其可以在多台计算机组成的Cluster中实现分布式或并行式计算。简单来说，我们是把一个很繁重的工作，分解成许多小任务，然后分给不同的计算机去处理，最后把计算结果汇总，以达到提高计算效率的目的。 Matlab的做法是这样的：在每台参与计算的计算机中启动一个叫Matlab Distributed Computing Engine的服务，该服务能启动参与计算的worker的Matlab session和管理各台计算机workers的job manager。Job manager对workers进行管理，给workers分配计算任务，接收workers计算后的结果。而你本人就是client，你要把你的工作分解为多个任务，然后把任务给job manager。job manager就会根据workers的多少和空闲情况，适当地把任务分配给workers去做。workers完成任务后，会把结果返回给job manager。当所有workers都完成任务后，你，即是client，便可以从job manager里取回结果。具体的概念可以参考Matlab的帮助，我们也不能说得很准确。我们在这里只想给出使用Matlab实现分布式计算的简单步骤，以便初学者快速入门。１、首先第一步要做的，就是令每台要参与计算的计算机组成局域网。比如我有三台计算机，其IP地址分别为192.168.1.101－192.168.1.103，以下简称计算机名为101，102和103。２、在三台计算机中安装Matlab Distributed Computing Engine（mdce）服务。安装方法为：如Matlab的安装地址为C:\Program Files\MATLAB\R2006b，则Start-Run-cmd到命令行窗口，进入C:\Program Files\MATLAB\R2006b\toolbox\distcomp\bin目录，运行mdce install命令安装mdce服务。接着去控制台-管理工具-服务，查看Matlab Distributed Computing Engine的属性。进入登录页，选择“此帐户”，输入NT AUTHORITY\NetworkService，删除下面的密码，让该服务以NetworkService的形式登入，以便该服务存取共享的映射网络驱动器中的原程序文件。接着便可以启动该服务了。注意以后重新开机，该服务都会启动，当然你可以设置让它手动启动。３、启动job manager。任一台计算机都可以启动job manager，只要mdce服务启动了即可。比如使用计算机101，在C:\Program Files\MATLAB\R2006b\toolbox\distcomp\bin目录下，运行以下命令：复制内容到剪贴板代码:startjobmanager -name frenseljobm该命令启动jobmanager，其名字叫frenseljobm，启动地点为计算机101。４、启动workers。任一台计算机都可以启动workers，只要mdce服务启动了即可。比如使用计算机101，在C:\Program Files\MATLAB\R2006b\toolbox\distcomp\bin目录下，运行以下命令：复制内容到剪贴板代码:startworker -jobmanagerhost 192.168.1.101 -jobmanager frenseljobm -name worker1此命令指明在计算机192.168.1.101中，启动名为worker1的worker，而该worker受名为frenseljobm的jobmanager管理。就是说来自乡下101的可怜工人worker1，成为万恶的监工frenseljobm的“马仔”了。接着，监工frenseljobm要在不同村102和103中雇用更多的工人worker2、worker3。运行如下的命令：复制内容到剪贴板代码:startworker -jobmanagerhost 192.168.1.101 -jobmanager frenseljobm -name worker2 -remotehost 192.168.1.102即可在102计算机中启动一个新的，名为worker2的worker，如此类推启动103计算机的worker3。使用nodestatus命令可以查看节点的状态，加上-remotehost可以查看其他节点的状态。５、如令计算机101为client，即我们的程序在这里编写的。设程序文件位于D:\Matlab_code\testDCT中。共享出文件夹Matlab_code，在文件夹中按工具-映射网络驱动器-令盘符为Z:-文件夹里填\\192.168.1.101\Matlab_code。于是Z:\testDCT便成为放置你程序的地方了。以同样的方法，让计算机102和103都建立映射网络驱动器，令盘符为Z:，文件夹里填\\192.168.1.101\Matlab_code。这时三台机都可以通过Z:\testDCT访问原程序文件。６、现在便可以进行计算了。这里给出测试的代码。首先写一个函数，模拟我们实际的工作。复制内容到剪贴板代码:% hp.m function f = hp(m, n) H1 = zeros(n); H2 = zeros(n); for i = 1 : m H = H1 + H2; end f = H; end将此程序hp.m放在D:\Matlab_code\testDCT中。此函数计算n维随机矩阵的加法m次。接着建立另一个m文件，做具体的分布式计算。复制内容到剪贴板代码:% runDCT.m tic % 寻找资源，比如jobmanager在什么地方，叫什么名字。 jm = findResource('scheduler', 'type', 'jobmanager', 'name',... 'frenseljobm', 'LookupURL', '192.168.1.101'); % 使用刚才找到的资源建立一个工作 job = createJob(jm); % 设置该工作的文件关联，让所有workers都可以找到原程序文件。 set(job, 'PathDependencies', {'Z:\testDCT'}) % 另一种方法，把用到的原程序文件传给所有workers。 % set(job, 'FileDependencies', {'hp.m'}) N = 100; M = 1000000; % 建立三个任务，每任务都是算hp(M, N)。 createTask(job, @hp, 1, {M, N}); createTask(job, @hp, 1, {M, N}); createTask(job, @hp, 1, {M, N}); % 提交工作给jobmanager。 submit(job) % 等待所有workers都把任务做完。 waitForState(job, 'finished') % 取出计算结果。 results = getAllOutputArguments(job); toc同样地，该程序runDCT.m也是放在D:\Matlab_code\testDCT中。该程序计算了三次100维矩阵的加法1000000次，即算了100维矩阵的加法3000000次。如果在单机上运行：复制内容到剪贴板代码: tic, a = hp(3000000, 100); toc Elapsed time is 63.096369 seconds.而使用三台机作分布式计算时：复制内容到剪贴板代码: runDCT Elapsed time is 24.323556 seconds. 效率有明显的提升。但注意到，当第一次进行分布式计算时，其他几台机要从Z:\testDCT中读取原程序文件，会使得计算速度降低。总结来说，Matlab的Distributed Computing Toolbox为我们提供了一种简便的分布式或并行式计算的实现方法。以上所写的是为了对DCT具体做法的整个过程做一次简单的介绍，我也是初学使用这个工具箱，文章可能很粗糙和存在许多谬误，敬请指正。转自： http://www.baisi.net/viewthread.php?tid=756999; 个人分类: matlab编程|4672 次阅读|0 个评论

[转载]matlab 并行计算开启小程序: robinmartin 2012-12-5 19:20; 在网易上看到的帖子，在系里面的dell上试了以下，效果不错。以后做系统模拟快多了！！今天搞了一下matlab的并行计算，效果好的出乎我的意料。本来CPU就是双核，不过以前一直注重算法，没注意并行计算的问题。今天为了在8核的 dell服务器上跑程序才专门看了一下。本身写的程序就很容易实现并行化，因为beamline 之间并没有考虑相互作用。等于可以拆成n个线程并行，要是有550核的话，估计1ms就算完了。。。先转下网上找到的资料。一、Matlab并行计算原理梗概 Matlab的并行计算实质还是主从结构的分布式计算。当你初始化Matlab并行计算环境时，你最初的Matlab进程自动成为主节点，同时初始化多个（具体个数手动设定，详见下文） Matlab计算子节点。Parfor的作用就是让这些子节点同时运行Parfor语句段中的代码。 Parfor运行之初，主节点会将Parfor循环程序之外变量传递给计算子节点。子节点运算过程时互不干扰，运算完毕，则应该有相应代码将各子节点得到的结果组合到同一个数组变量中，并返回到Matlab主节点。当然，最终计算完毕应该手动关闭计算子节点。二十六、初始化Matlab并行计算环境这里讲述的方法仅针对多核机器做并行计算的情况。设机器的CPU核心数量是CoreNum双核机器的CoreNum2，依次类推。CoreNum以不等于核心数量，但是如果CoreNum小于核心数量则核心利用率没有最大化，如果CoreNum大于核心数量则效率反而可能下降。因此单核机器就不要折腾并行计算了，否则速度还更慢。下面一段代码初始化Matlab并行计算环境： %Initialize Matlab Parallel Computing Enviornment by Xaero | Macro2.cn CoreNum=2; %设定机器CPU核心数量，我的机器是双核，所以CoreNum=2 if matlabpool('size')=0 %判断并行计算环境是否已然启动 matlabpool('open','local',CoreNum); %若尚未启动，则启动并行环境 else disp('Already initialized'); %说明并行环境已经启动。 end 运行成功后会出现如下语句： Starting matlabpool using the 'local' configuration ... connected to 2 labs. 如果运行出错，按照下面的办法检测：首先运行： matlabpool size 如果出错，说明你没有安装Matlab并行工具箱。确认安装了此工具箱后，运行： matlabpool open local 2; 如果出错，证明你的机器在开启并行计算时设置有问题。请联系MathWorks的售后服务。二十七、终止Matlab并行计算环境用上述语句启动Matlab并行计算环境的话，在你的内存里面有CoreNum个Matlab进程存在，每个占用内存都在百兆以上。（可以用Windows任务管理器查看），故完成运行计算后可以将其关闭。关闭的命令很简单： matlabpool close 二十八、Matlab做Monte Carlo并行的算法 Matlab并行计算比较特别。下图节选自Matlab并行计算工具箱用户手册。这个列表告诉你 Matlab如何处理Parfor并行计算程序段中的各种变量。所以写代码时要注意不少问题，否则写出的并行代码可能还不如非并行的代码快。 MATLAB并行计算 - CD_Keanu - 智慧人生这里我推荐大家用Matlab写Monte Carlo并行代码时按照以下注意事项来写： 1．将Monte Carlo模拟过程中不会改变的参数都写在Parfor循环块外面 2．生成随机数、计算f(x)等过程都写在Parfor里面 3．不要将V0结果传递出Parfor，而是直接计算出V0的均值、方差传递出parfor。 4．最后用数学公式将传递出Parfor的V0的均值方差组合计算成最终结果这些事项如何体现到程序中请参照示例代码文件并结合视频教程学习。这样的并行办法简单易行，对原始程序没有太大的改动，同时传递变量耗费时间也较少，效率比较高。另外一个问题就是并行代码做模拟的次数问题。我们要达到用非并行的代码做N此模拟所能得到结果的精确程度，在核心为CoreNum并行代码中，Parfor语句段中只要做N/CoreNum次即可达到。二十九、将例子改写为并行代码附件中的pareg1.m，……，pareg5.m五个文件分别是前一章五个例子的并行代码。这里需要提到的是，这五个代码文件都是用向量化的代码编写。原因在于，在前一章大家都看到了，向量化的代码比循环语句代码一般快几十甚至上千倍，所以要提高速度，向量化代码是最重要的优化方法，并行计算倒是其次。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 由于要搜索多核运行，找到这个帖子里来了刚才试了一下，我使用的MATLAB2010可以多核运行的。需要多核多线程跑的算法，在之前要让matlab在本地建立4个“实验室”（我的机器是4核，所以是4个） matlabpool local 4 Starting matlabpool using the 'local' configuration ... connected to 4 labs. 显示正在进行多核配置，一会说，连接到4个“实验室”。我理解就是在本地虚拟出4台可以运行matlab的工作站，这样用分布式计算工具箱可以进行并行计算（matlabpool这个命令好像是在并行计算工具箱里的）。观察windows任务管理器，可以发现一共有5个MATLAB .exe进程。其中一个占内存较多的，我理解是主控的，他基本不干活，只负责分配，进行计算时他的cpu占用率只有1~2%，剩下四个进程专门用来计算的，跑起来各占cpu 25%左右。看上去还是每个matlab进程单核运算，但是一下开4个进程，所以能把cpu用满。如果后续还需要多核运算，就直接用parfor好了，不用每次都用matlabpool命令。那个配置一次就好。算完了，不再跑了，临退出时关闭配置就行。 matlabpool close Sending a stop signal to all the labs ... stopped. 下面是我一个M文件的程序，测测4核并行计算和单核计算的差距，很简单。 function testtime runtimes = 1e9; dummy1 = 0; dummy2 = 0; %matlabpool local 4 tic %for x= 1:runtimes; parfor x= 1:runtimes; dummy1 = dummy1 + x; dummy2 = 2 * x + 1; end toc plot( , ); 第一次用普通for语句，单核跑，6.09秒 testtime Elapsed time is 6.094267 seconds. 第二次用parfor语句，4核跑，1.63秒 matlabpool local 4 Starting matlabpool using the 'local' configuration ... connected to 4 labs. testtime Elapsed time is 1.631350 seconds. matlabpool close 加速比 6.09 / 1.63 = 3.736，将近4倍（还有开销吧），还比较可观。 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 然后说一下要注意的几个问题： 1、parfor效果好，但是用起来要注意程序的细节。很多地方都会报错。比如下标必须为连续的整数！否则会报下面这个错误“The variable xxx in a parfor cannot be classified.”具体可以看parfor帮助文件里面的sliced variables这一节，帮助建议仔细全部看过最好。 2、用了parfor之后，输出参数用nargout确定，会出错。不知道为什么。 3、变量在parfor内外的传进传出要非常小心，因为并行的关系，依赖循环下标的变量都要仔细考虑。在我的程序里面，用profile监测，不用并行计算的时候，CPU时间为 84.742 S，用了并行计算的时候，CPU时间为 15.231 S 加速比达到了5.6！！！！！！ Oh my lady gaga！！！！！！双核E6400，不知道为什么加速比这么恐怖. 明天在xeon 5310上面去试试，双CPU，一共8核，不知道会是什么样子，估计最多1S。。。; 个人分类: 设计参考|0 个评论

[转载]R语言处理大数据: chuangma2006 2012-9-29 05:41; R最大的缺点就是不能自动进行并行计算和内存限制。内存限制的突破 “参考网址1”中提到如果只是对整数运算（运算过程和结果都只使用整数），没有必要使用“double”(8 byte)，而应该用更小的“integer”(4 byte)。使用storage.mode(x)查看对象存数的模式，storage.mode(x) - 进行赋值；使用object.size(a)查看对象占用的内存空间（此处有疑问，即在R中每个integer到底占用了多大的空间？）。需要解释gc( )函数，可以查看内存使用情况。同样，在清除了大的对象之后，使用gc( )以释放内存使用空间。李航在”参考网址2“中提到，对于大矩阵的操作，尽量避免使用cbind和rbind之类，因为这会让内存不停地分配空间。“对于长度增加的矩阵，尽量先定义一个大矩阵，然后逐步增加”和“注意清除中间对象”。使用bigmemory家族：bigmemory, biganalytics, synchronicity, bigtabulate and bigalgebra，同时还有 biglm。 bigmemory package的使用： 1 建立big.memory对象 bigmemory采用C++的数据格式来“模仿”R中的matrix。编写大数据格式文件时候，可以先建立filebacked.big.matrix big.matrix(nrow, ncol, type = options()$bigmemory.default.type, init = NULL, dimnames = NULL, separated = FALSE, backingfile = NULL, backingpath = NULL, descriptorfile = NULL, shared = TRUE) filebacked.big.matrix(nrow, ncol, type = options()$bigmemory.default.type, init = NULL, dimnames = NULL, separated = FALSE, backingfile = NULL, backingpath = NULL, descriptorfile = NULL) as.big.matrix(x, type = NULL, separated = FALSE, backingfile = NULL, backingpath = NULL, descriptorfile = NULL, shared=TRUE) 使用注意： big.matrix采用两种方式储存数据：一种是big.matrix默认的方式，如果内存空间比较大，可以尝试使用；另外一种是filebacked.big.matrix，这种储存方法可能会备份文件（file-backings），而且需要descriptor file； “init”指矩阵的初始化数值；如果不设置，将处理为NA "type"是指在big.matrix中atomic element的储存格式，默认是“double”(8 byte)，可以改为“integer”(4 byte), "short"(2 byte) or "char"(1 byte)；在big.matrix非常大的时候，避免使用rownames和colnames(并且bigmemory禁止用名称访问元素)，因为这种做法非常占用内存。如果一定要改变，使用options(bigmemory.allow.dimnames=TRUE)，之后colnames,rownames设置。直接在命令提示符后输入x（x是一个big matrix），将返回x的描述，不会出现所有x中所有内容。因此，注意x (打印出矩阵全部内容)；如果big.matrix有很多列，那么应该将其转置后储存；（不推荐）或者将参数“separated”设置为TRUE，这样就将每一列分开储存。否则，将用R的传统方式（column major的方式）储存数据。如果建立一个filebacked.big.matrix，那么需要指定backingfile的名称和路径+descriptorfile。可能多个big.matrix对象对应唯一一个descriptorfile，即如果descriptorfile改变，所以对应的big.matrix随之改变；同样，decriptorfile随着big.matrix的改变而改变；如果想维持一种改变，需要重新建立一个filebacked.big.matrix。attach.big.matrix(descriptorfile or describe(big.matrix))函数用于将一个descriptorfile赋值给一个big.matrix。这个函数很好用，因为每次在创建一个filebacked.big.matrix后，保存R并退出后，先前创建的矩阵会消失，需要再attach.big.matrix以下 2 对big.matrix的列的特定元素进行条件筛选对内存没有限制；而且比传统的which更加灵活（赞！） mwhich(x, cols, vals, comps, op = 'AND') x既可以是big.matrix，也可以是传统的R对象； cols：行数 vals：cutoff，可以设定两个比如c(1, 2) comps：'eq'(==), 'neq'(!=), 'le'(), 'lt'(=), 'ge'() and 'gt'(=) op：“AND”或者是“OR” 可以直接比较NA，Inf和-Inf 3 bigmemory中其他函数 nrow, ncol, dim, dimnames, tail, head, typeof继承base包 big.matrix, is.big.matrix, as.big.matrix, attach.big.matrix, describe, read.big.matrix, write.big.matrix, sub.big.matrix, is.sub.big.matrix为特有的big.matrix文件操作；filebacked.big.matrix, is.filebacked（判断big.matrix是否硬盘备份） , flush(将filebacked的文件刷新到硬盘备份上)是filebacked的big.matrix的操作。 mwhich增强base包中的which， morder增强order，mpermute（对matrix中的一列按照特定序列操作，但是会改变原来对象，这是为了避免内存溢出） big.matrix对象的copy使用deepcopy(x, cols = NULL, rows = NULL, y = NULL, type = NULL, separated = NULL, backingfile = NULL, backingpath = NULL, descriptorfile = NULL, shared=TRUE) biganalytics package的使用 biganalytics主要是一些base基本函数的扩展，主要有max, min, prod, sum, range, colmin, colmax, colsum, colprod, colmean, colsd, colvar, summary等比较有特色的是bigkmeans的聚类剩下的biglm.big.matrix和bigglm.big.matrix可以参考Lumley's biglm package。 bigtabulate package的使用并行计算限制的突破：使用doMC家族：doMC, doSNOW, doMPI, doRedis, doSMP和foreach packages. foreach package的使用 foreach(..., .combine, .init, .final=NULL, .inorder=TRUE, .multicombine=FALSE, .maxcombine=if (.multicombine) 100 else 2, .errorhandling=c('stop', 'remove', 'pass'), .packages=NULL, .export=NULL, .noexport=NULL, .verbose=FALSE) foreach的特点是可以进行并行运算，如在NetWorkSpace和snow？ %do%顺序执行任务，%dopar%并行执行任务 ...：指定循环的次数； .combine：运算之后结果的显示方式，default是list，“c”返回vector， cbind和rbind返回矩阵，"+"和"*"可以返回rbind之后的“+”或者“*” .init：.combine函数的第一个变量 .final：返回最后结果 .inorder：TRUE则返回和原始输入相同顺序的结果（对结果的顺序要求严格的时候），FALSE返回没有顺序的结果（可以提高运算效率） .muticombine：设定.combine函数的传递参数，default是FALSE表示其参数是2，TRUE可以设定多个参数 .maxcombine：设定.combine的最大参数 .errorhandling：如果循环中出现错误，对错误的处理方法 .packages：指定在%dopar%运算过程中依赖的package（%do%会忽略这个选项）。 getDoParWorkers( ) ：查看注册了多少个核，配合doMC package中的registerDoMC( )使用 getDoParRegistered( ) ：查看doPar是否注册；如果没有注册返回FALSE getDoParName( ) ：查看已经注册的doPar的名字 getDoParVersion( )：查看已经注册的doPar的version =================================================== # foreach的循环次数可以指定多个变量，但是只用其中最少？的 foreach(a = 1:10, b = rep(10, 3)) %do% (a*b) ] 10 ] 20 ] 30 # foreach中.combine的“+”或者“*”是cbind之后的操作；这也就是说"expression"返回一个向量，会对向量+或者* foreach(i = 1:4, .combine = "+") %do% 2 8 foreach(i = 1:4, .combine = "rbind") %do% rep(2, 5) result.1 2 2 2 2 2 result.2 2 2 2 2 2 result.3 2 2 2 2 2 result.4 2 2 2 2 2 foreach(i = 1:4, .combine = "+") %do% rep(2, 5) 8 8 8 8 8 foreach(i = 1:4, .combine = "*") %do% rep(2, 5) 16 16 16 16 16 ============================================= iterators package的使用 iterators是为了给foreach提供循环变量，每次定义一个iterator，它都内定了“循环次数”和“每次循环返回的值”，因此非常适合结合foreach的使用。 iter(obj, ...)：可以接受iter, vector, matrix, data.frame, function。 nextElem(obj, ...)：接受iter对象，显示对象数值。以matrix为例， iter(obj, by=c('column', 'cell', 'row'), chunksize=1L, checkFunc=function(...) TRUE, recycle=FALSE, ...) by：按照什么顺序循环；matrix和data.frame都默认是“row”，“cell”是按列依次输出（所以对于“cell”，chunksize只能指定为默认值，即1） chunksize：每次执行函数nextElem后，按照by的设定返回结果的长度。如果返回结构不够，将取剩余的全部。 checkFunc=function(...) TRUE：执行函数checkFun，如果返回TRUE，则返回；否则，跳过。 recycle：设定在nextElem循环到底（“错误: StopIteration”）是否要循环处理，即从头再来一遍。以function为例 iter(function()rnorm(1))，使用nextElem可以无限重复；但是iter(rnorm(1))，只能来一下。更有意思的是对象如果是iter，即test1 - iter(obj); test2 - iter(test1)，那么这两个对象是连在一起的，同时变化。 ============================================== a 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 4 8 12 16 20 i2 - iter(a, by = "row", chunksize=3) nextElem(i2) 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 nextElem(i2) #第二次iterate之后，只剩下1行，全部返回 4 8 12 16 20 i2 - iter(a, by = "column", checkFunc=function(x) sum(x) 50) nextElem(i2) 13 14 15 16 nextElem(i2) 17 18 19 20 nextElem(i2) 错误: StopIteration colSums(a) 10 26 42 58 74 testFun - function(x){return(x+2)} i2 - iter(function()testFun(1)) nextElem(i2) 3 nextElem(i2) 3 nextElem(i2) 3 i2 - iter(testFun(1)) nextElem(i2) 3 nextElem(i2) 错误: StopIteration i2 - iter(testFun(1)) i3 - iter(i2) nextElem(i3) 3 nextElem(i2) 错误: StopIteration ============================================ iterators package中包括 irnorm(..., count)；irunif(..., count)；irbinom(..., count)；irnbinom(..., count)；irpois(..., count)中内部生成iterator的工具，分别表示从normal，uniform，binomial，negativity binomial和Poisson分布中随机选取N个元素，进行count次。其中，negative binomial分布：其概率积累函数(probability mass function)为掷骰子，每次骰子为3点的概率为p，在第r+k次恰好出现r次的概率。 icount(count)可以生成1:conunt的iterator；如果count不指定，将从无休止生成1:Inf icountn(vn)比较好玩，vn是指一个数值向量（如果是小数，则向后一个数取整，比如2.3 -- 3）。循环次数为prod(vn)，每次返回的向量中每个元素都从1开始，不超过设定 vn，变化速率从左向右依次递增。 idiv(n, ..., chunks, chunkSize)返回截取从1:n的片段长度，“chunks”和“chunkSize”不能同时指定，“chunks”为分多少片段（长度从大到小），“chunkSize”为分段的最大长度（长度由大到小） iapply(X, MARGIN)：与apply很像，MARGIN中1是row，2是column isplit(x, f, drop=FALSE, ...)：按照指定的f划分矩阵 ============================================= i2 - icountn(c(3.4, 1.2)) nextElem(i2) 1 1 nextElem(i2) 2 1 nextElem(i2) 3 1 nextElem(i2) 4 1 nextElem(i2) 1 2 nextElem(i2) 2 2 nextElem(i2) 3 2 nextElem(i2) 4 2 nextElem(i2) 错误: StopIteration ============================================= 参考文献： 1 http://jliblog.com/archives/276 2 http://cos.name/wp-content/uploads/2011/05/01-Li-Jian-HPC.pdf 3 R 高性能计算和并行计算 http://cran.r-project.org/web/views/HighPerformanceComputing.html zz from: http://blog.sina.com.cn/s/blog_61f013b80100xxir.html; 个人分类: R|5045 次阅读|0 个评论

多节点并行环境变量设置及文件存放位置要求: liujd 2012-1-17 20:31; 1，程序所涉及到的bin和lib以及其他的运行需要的相关文件必须存放在所有节点都能共享的目录上 2，程序需要的数据同样要放在所有节点能够共享的目录上 3，此用户的所有节点的环境变量必须都能指向到需要的库及bin等文件。因此此用户的每个节点上.bashrc执行时需要指向需要的目录路径，每个节点上的.bashrc文件相同，都指向要把环境变量添加到需要的位置。对于一个程序运行，需要大量的环境变量设定时，可以采用间接的方法实现，让所有用户都能共享到：（1）在公共的目录上建立一个环境变量添加的sh文件，文件的内容如下： gcc_HOME=/data-SAS/soft/gcc-4.6.2/gcc-4.6.2 export PATH=${gcc_HOME}/bin:$PATH export LD_LIBRARY_PATH=${gcc_HOME}/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=${gcc_HOME}/lib64:$LD_LIBRARY_PATH export MANPATH=${gcc_HOME}/share/man:$MANPATH （2）在需要这个程序运行的用户目录下的.bashrc文件中添加一句话： source /data-SAS/soft/gcc-4.6.2/gcc4.6.2.sh 这样这个用户一旦登录系统，那么他将自动添加环境变量，这样就可以直接运行这个程序了。 4，当有多个版本的软件供选择时（1）为每个版本的软件建立一个添加环境变量的sh文件（2）用户修改.bashrc文件中source对象即可。; 个人分类: linux|1405 次阅读|0 个评论

[请教] 理论计算机科学中的“相似性原理”: 热度 6 zlyang 2011-11-10 11:10; 理论计算机科学中的“相似性原理” 《中国大百科全书》中理论计算机科学中的“相似性原理”，摘录如下：这样，复杂性的问题有没有一个相对统一的标准呢？相似性原理解答了这一问题。按此原理，计算一个问题类使用的并行时间、空间和串行时间的复杂程度在所有理想的计算模型中都没有本质的差异。用数学语言来说，各种模型不仅可以相互模拟而且模拟者所需的并行时间、空间和串行时间都分别不超过被模拟者需用的并行时间、空间和串行时间的一个多项式。所以在只差一个多项式的范围内，复杂性还是有其客观依据而是不依赖于模型的。对于上面提到的各种模型，以及各种计算类型，相似性原理已被证明是正确的。　　对于串行计算模型都可以定义一个虚拟的并行时间，叫做巡回。所谓巡回，是指计算过程中的周相数。而一个周相则是计算过程中的一个阶段，在此阶段中新计算出来记在工作空间上的信息不准在同一阶段内被读到。例如，对于多带图灵机器而言，一个周相就是工作带头不改变方向的一段时间。所以巡回相当于工作带头改变方向的总次数。因此可以说，一个问题类有快速的并行算法当且仅当它有一个具有小的巡回数(虚拟并行时间)的串行算法。另外并行时间和空间之间具有某些对称的性质。例如可以证明，它们之间是多项式相关联的。所以说，一个问题类有一个快速并行时间算法当且仅当它有一个高度节省内存的算法。可是俺看不懂什么意思。请您解释解释吧！真正的专家！谢谢！该原理是谁证明的？原文是汉语还是英文？发表在哪里？现在哪里有准确详细的用英文介绍？等。请您指导！真正的专家！谢谢！ ————————————————————— 附录 ————————————————————— http://210.45.210.9:918/web/index.htm lilun jisuanji kexue 理论计算机科学 (卷名：数学) theoretical computer science 　　复杂性理论和算法分析　可计算性理论在逻辑上的进一步发展，就是计算复杂性理论。著名的图灵-丘奇论题说，凡是合理的计算模型都是等价的，即一个模型能计算的，在别的模型下也能计算，否则都不可计算。但是对于一个问题类而言，只知道能不能计算是很不够的，更有实际意义的是要知道计算需要多少时间，多大的内存等等。由此产生了计算复杂性理论和算法分析。　　计算所需的时间、存储大小等等都算为资源。严格地讲，每一种资源的定义都依赖于一个计算模型。各种计算模型的资源的定义虽不一样，但是主要的有三种。　　① 串行时间　简称时间，它是计算过程中的总运算量。也即把计算分成一些原始的步骤，完成这些步骤所需的总时间。　　② 空间它是为了保存计算的中间结果所需地盘的大小。例如在计算时用一块黑板来打草稿，假定每一单位黑板面积可以写一个符号，且可以擦掉重写，空间就相当于打草稿所需的黑板面积。　　③ 并行时间即并行计算所需的时间。也即多人或多机协同计算，解决一个问题所需要的时间。　　复杂性总是对于一个特定的问题类来讨论的，其中包括无穷多个个别问题，有大有小。例如对矩阵乘法这一问题类而言，相对地说100阶矩阵相乘就是一个较大的问题，而二阶矩阵相乘则是个较小的问题。所以可以把阶 n 作为衡量问题大小的尺度。但是最一般地，是把输入的总长度 n 作为问题大小的尺度。因此，当给定一个算法以后，计算一个大小为 n 的问题所需的时间、空间等就可以表为 n 的函数。这个函数就叫做该算法的时间或空间的复杂性之度量，或称为复杂度。严格地讲，它是这个特定的问题类在某一特定计算模型中的某一算法下的复杂度。当要解决的问题越来越大时，时间、空间等资源的耗费将以什么样的速率增长，也即当 n 趋于无穷时这个函数增长的阶是什么？这就是复杂性理论所关心的问题。　　在计算同一个问题类时，算法有好坏之别。例如要判定一个具有 n 个顶点的无向图中有没有回路，早期的算法所需的空间复杂度为 S ( n )= O (log 2 n )，但是后来设计了更精细的算法，它的空间复杂度只有 O （log n ）。这说明 O （log 2 n ）只是早期算法的空间复杂度，而并非这个问题本身的内在复杂度。或者说 O （log 2 n ）只是回路问题空间复杂度的一个上界，而 O （log n ）则是一个新的更好的上界。对于回路问题，任何算法都至少需要正比于log n 的工作空间。这就是说，对于任何算法，空间复杂度 S ( n )= Ω (log n )。因此可以认为 Ω (log n )是回路问题空间复杂度的一个下界。问题内在的复杂度介于上界和下界之间。在这个例子中，二者重合了。因此可以说回路问题的空间复杂度正比于log n ，记为 S ( n )=嘷 (log n )。　　又如两个 n 位二进整数的乘法在多带图灵机模型下，一般的算法需时 O ( n 2 )。但改进的算法可以达到 O ( n 1.5 )。现在最好的算法可以达到 O ( n log n loglog n )。如果采用存储可修改机器做模型，则可以达到 O ( n )。可以看出一个问题的内在计算复杂性还因计算模型的不同而有高低。　　较为重要而且具有特色的计算模型主要有多带图灵机器，多带多维多头图灵机器，硬件可修改机器（HMM），随机存取机器(RAM)，并行随机存取机器(PRAM)，向量机器，VLSI，等等。然而没有一个适合一切问题的统一的模型。这样，复杂性的问题有没有一个相对统一的标准呢？相似性原理解答了这一问题。按此原理，计算一个问题类使用的并行时间、空间和串行时间的复杂程度在所有理想的计算模型中都没有本质的差异。用数学语言来说，各种模型不仅可以相互模拟而且模拟者所需的并行时间、空间和串行时间都分别不超过被模拟者需用的并行时间、空间和串行时间的一个多项式。所以在只差一个多项式的范围内，复杂性还是有其客观依据而是不依赖于模型的。对于上面提到的各种模型，以及各种计算类型，相似性原理已被证明是正确的。　　对于串行计算模型都可以定义一个虚拟的并行时间，叫做巡回。所谓巡回，是指计算过程中的周相数。而一个周相则是计算过程中的一个阶段，在此阶段中新计算出来记在工作空间上的信息不准在同一阶段内被读到。例如，对于多带图灵机器而言，一个周相就是工作带头不改变方向的一段时间。所以巡回相当于工作带头改变方向的总次数。因此可以说，一个问题类有快速的并行算法当且仅当它有一个具有小的巡回数(虚拟并行时间)的串行算法。另外并行时间和空间之间具有某些对称的性质。例如可以证明，它们之间是多项式相关联的。所以说，一个问题类有一个快速并行时间算法当且仅当它有一个高度节省内存的算法。　　既然在多项式关联意义下复杂性不依赖于模型，就可以用 P 代表所有具有多项式时间算法的问题类;用 N P 代表所有具有非确定多项式时间算法的问题类；用 N C 代表所有能同时在对数多项式并行时间和多项式空间解决的问题类；等等。　　一般认为， P 中的问题才具有现实的可计算性，但有许多实际问题属于 N P ，却未能找到一个确定型多项式时间算法。所以许多人猜想 P N P 。1971年库克在 N P 中找到一个问题类，叫做布尔合取范式的可满足性问题( S A T )。证明了 N P 中任何一个问题类的计算均可归约为 S A T 的计算。因此，称 S A T 为一个 N P 完全性问题（见组合最优化）， N P = P 当且仅当对 S A T 存在一个确定型多项式时间的算法。以后C.卡普等人又把 S A T 归约为许多其他的组合问题，得出了许多其他的 N P 完全性问题。目前 N P 完全性问题几乎涉及到一切数学的分支，形成了一个专门的理论。但是 N P 是否等于 P 的问题尚未解决。有人猜想它是独立于现有的数学公理系统的。完全性的研究不限于 N P 完全性。对于各个复杂性类，都有完全性问题。　　在复杂性基本理论的基础上，对各类具体数学问题的复杂性研究构成了算法分析的主要内容，60年代以来取得了长足的进展。例如 n 维的快速傅里叶变换和 n 次多项式的乘法，只需要 O ( n log n )次算术运算； n 位的数乘在多带图灵机上只需 O （ n log n loglog n ）的时间;判定一个 n 位数是否为素数需时 O ( )；在出度限定情况下，图同构的判定问题存在多项式时间的算法；判定整系数线性规划问题是否存在有理解，存在多项式时间算法； n 阶矩阵乘法只需4.7 n 2.8 次算术运算;后来又改进为 O ( n )，还证明了，这个阶可以无穷地改进下去，等等。　　对比于上界的情况，下界的研究却碰到了许多困难，尤其是对一些具有实际意义的问题。例如对任何一个 N P 完全性问题，猜想至少需要指数的时间，但多年来连一个非线性的下界都证不出来。　　除了计算的复杂性，还有描述的复杂性，这是 Α.Η.柯尔莫哥洛夫和察廷提出来的。一个01串 w 的信息量 I ( w )被定义为输出 w 的最小程序的长度。因为各个模型可以互相模拟，故以上的定义在只差一个常数值的意义下，是不依赖于模型的。因为任何一个数学公理系统所包含的信息量是有限的，因而在这个系统中不可能证出 I ( w )≥с形的定理，这里с是只依赖于公理系统的一个常数。但容易知道，除了有限多个 w 以外，所有的 w 都满足 I ( w )≥с（却无一能被证明）。这一结果简化和加强了哥德尔不完备性定理。洪加威; 6812 次阅读|11 个评论

时空模型的一些资料: 热度 1 vcitym 2011-8-2 23:18; 1、朱定局　著，并行时空模型，科学出版社，2009 作者简介：朱定局，北京大学博士后，2008年任Texas State University访问学者，现中国科学院深圳先进技术研究院智慧计算与信息科学实验室（原名：云计算与信息服务实验室）主任，Journal on Advances in Information Sciences and Service Sciences等编委。其专著有《并行时空模型》、《智慧城市并行方法》、《自然云计算理论》等。其已授权发明专利有 “一种数字城市的继承式自动生成及实时更新方法”、“一种图像处理中物体自动识别并三维重建的方法”等。《并行时空模型》内容简介本书提出并研究了并行时空模型：从科学层面上研究如何时空并行，即如何理解和表示“时空并行性”；从技术层面上研究如何构建能在并行计算机上运行的时空模型，即在并行计算机上模拟时空并行性的方法；从应用层面上研究如何应用并行时空模型，这对地理信息科学领域中时空模型的并行化、数字城市的规模化和并行化，以及计算机科学领域中并行计算的应用推广都具有理论和实践意义。 2、湖南师范大学数学与计算机科学学院钱光明的《时空模型及时空运行图》，计算机工程与应用，2004.27,73-75,113 摘要周期性实时系统已被广泛研究,与之相关的自动调度算法主要有三类:优先级驱动的、基于速率的和基于时间的。这些自动算法虽然能解决许多应用问题,但有时得出的调度方案不一定是最佳的;有时甚至干脆得不出要在整个运行期间均满足要求的调度方案,如某些临界区的存取就可能导致 EDF算法无解。主要原因是单一的算法难以保证适应各种复杂的应用环境。论文从系统的资源划分出发,提出了时空模型及时空运行图的概念,阐述了利用时空图来调整已得调度方案的理由,并通过举例,说明应用时空图确实可能优化自动算法得出的调度方案。 3、赵晟琪 . 基于交通状态时空模型的区域路网性能分析方法研究,吉林大学硕士论文，2010 摘要：日益加剧的城市交通拥堵问题一定程度上阻碍着城市的发展,促使着城市不断寻求先进的手段解决交通拥堵问题。在此大背景下,先进的城市交通管理(ATMS)系统应运而生。交通信息是ATMS的核心,其各项工作都需基于交通信息才得以开展。而对路网进行性能评价是获取路网动静态交通信息最直接有效的途径。因此本文从城市交通流的随机性、非线性、复杂性和时空相关联性谈起,分析了进行路网性能评价的目的意义,全面综合国内外在路网性能评价方面的研究成果,为路网交通状态有效评价的实施提供指导;基于交通流参数的可获得程度、描述交通流状态的精度以及对交通流变化的反应灵敏性等原则,选取进行路网交通状态综合评价的特征量,并提出交叉口与区域交通流特征参数提取技术;本文提出了基于模糊综合判别的区域交通状态判别方法,为之后路网性能分析奠定了基础,然后建立了区域交通状态时空模型;最后进行包含路口可达性分析、路段连通性分析以及交通状态时间序列分析的路网性能分析方法,并进行仿真验证方法的有效性。（待续）; 个人分类: 技术相关|4265 次阅读|3 个评论

GPU加速的Gromacs 4.5.1 进行分子动力学模拟: yuanhui80 2011-6-4 11:56; 计算机技术的快速发展，算法及相应软件的不断更新，使得当前利用我们手上的普通电脑来模拟相对较大的生物大分子体系和多聚体分子成为了可能，而且这种趋势会越来越明显，尤其是多核心CPU的出现及分子动力学模拟大规模并行化计算能力的提高，让从事生物学研究的人们有可能利用手中的个人电脑对感兴趣的蛋白分子进行有目的的模拟，并充分与生物学实验有机地结合在一起，这是一件非常有意思和好玩的事情。尽管如此，许多生物大分子体系还是非常巨大的，比如我的一个体系：腺病毒六邻体(Hexon)三聚体蛋白，我想对其进行温控的分子动力学模拟，以动态分析其总表位构成、高温变性机制及高温变性在免疫原性上的反应。该体系共含有约940*3 = 2820个氨基酸残基，再加上一个立方体的水盒子，总体系约200,000个原子数，在QX9650四核心CPU，64位Linux系统下，Gromacs每模似10ns的时间要花上约27天，非常耗时，在一个约500ns的总体设计中，这种计算能力是无法忍受的。 GPU加速的Gromacs为我们带来了非常振奋的好消息，官方称利用Nvidia的CUDA技术，可以将MD模拟提高原单CPU的十倍以上，以下是我利用Nvidia GTX460 2G 进行分子动力学模拟的全过程，现拿出来和大家进行分享。第一天： Nvida GTX460 2G大显存按gmx网站( www. gromacs .org/gpu )上的说明，可以模拟 200,000左右的原子，我模拟的体系则好是190,000 硬件：Dell T3400工作站 X38主板 CPU: QX9650 4G ECC内存 GTX460 2G显存软件：Ubuntu9.10 64位, CUDA3.1, OpenMM2.0, FFTW3.2.2, CMake, Gromacs 4.5.1, 按照官网要求，独立安装CPU的Gromacs4.5.1(CMake 编译 )，再下载预编译好的 mdrun-gpu beta2 for Ubuntu9.10 64位设好环境变量，运行~ 但是运行后，提示： Fatal error : reading tpx file (md.tpr) version 73 with version 71 program For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 大概的意思是说：预编译好的mdrun-gpu跑不了由4.5.1版 grompp 程序生成的tpr 文件。第二天：采取从头编译的方法解决了上述问题，因为预编译好的mdrun-gpu与4.5.1里的程序版本号不同，所以会出现不兼容现象，按照提示，顺利编译4.5.1版的mdrun-gpu成功， —————————————————————————— export OPENMM_ROOT_DIR=path_to_custom_openmm_installation cmake -DGMX_OPENMM=ON make mdrun make install-mdrun —————————————————————————— 但是新的问题来了，运行出现错误提示： mdrun-gpu: error while loading shared libraries: libopenmm_api_wrapper.so: cannot open shared object file: No such file or directory 很奇怪！环境变量也设好了，没有问题在openmm目录下找不到libopenmm_api_wrapper.so文件第三天：我将操作系统换成RHEL5.5系统再利用相同的安装方法，顺利解决上述问题，但不明白其中原因，不过我想还是有办法解决的(先不管它)！第四天：总结一下我所遇到的问题，及解决办法： 1，版本问题 —————————————————————————— Fatal error: reading tpx file (md.tpr) version 73 with version 71 program For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 这里是说版本不兼容 2，Openmm不支持多组的温度耦合 —————————————————————————— Fatal error: OpenMM does not support multiple temperature coupling groups. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 3，不能按以前的mpd来设置 —————————————————————————— Fatal error: OpenMM uses a single cutoff for both Coulomb and VdW interactions. Please set rcoulomb equal to rvdw. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 4，GPU加速的gmx现在只支持amber力场及charmm力场 —————————————————————————— Fatal error: The combination rules of the used force-field do not match the one supported by OpenMM: sigma_ij = (sigma_i + sigma_j)/2, eps_ij = sqrt(eps_i * eps_j). Switch to a force-field that uses these rules in order to simulate this system using OpenMM. 5，GPU加速的gmx不支持G96里的 interaction ,实际上还是力场问题 —————————————————————————— Fatal error: OpenMM does not support (some) of the provided interaction type(s) (G96Angle) For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors 6，在Ubuntu9.10里面用cmake编译gromacs4.5.1会遇到找不到libopenmm_api_wrapper.so文件的问题，换成RHEL5.5可以解决 —————————————————————————— error while loading shared libraries: libopenmm_api_wrapper.so: cannot open shared object file: No such file or directory 第五天： mdrun-gpu终于跑起来了，mdp文件是用的官网提供的 bench里面的，不过还是有一些warning: It is also possible to optimize the transforms for the current problem by performing some calcula- tions at the start of the run. This is not done by default since it takes a couple of minutes, but for large runs it will save time. Turn it on by specifying optimize_fft = yes WARNING: OpenMM does not support leap-frog, will use velocity-verlet integrator. WARNING: OpenMM supports only Andersen thermostat with the md/md-vv/md-vv-avek integrators. Pre-simulation ~15s memtest in progress...done, no errors detected starting mdrun 'Protein in water' 1000000 steps, 2000.0 ps. NODE (s) Real (s) (%) Time: 33.080 99.577 33.2 (Mnbf/s) (MFlops) (ns/day) (hour/ns) Performance: 0.000 0.074 47.609 0.504 gcq#330: "Go back to the rock from under which you came" (Fiona Apple) ———————————————————————————————— 最后这个Performance，有点看不懂，单从(ns/day) 这一点看，性能是l四核心CPU的五倍，但实际运行，性能仅是CPU的2倍， (MFlops) 一项，竟然是 0.074 CPU的 (MFlops) 是12GFlops 总体上看，GPU加速的GMX，性能提升，至少可以达到传统四核心CPU的2倍， imp模型官网上说可达到10倍以上，继续更新中。。。。。。第六天： 190,000个原子的体系，共设了10ns，performance显示是5 ns/day，理论上两天就算完了，实际上得到28号才算完(10月18号下午1点开始)，这个结果和performance明显不符~~~ 5000000 steps, 10000.0 ps. step 417300, will finish Thu Oct 28 10:39:59 2010 /10月18号开始，显示10月28号结束 Received the TERM signal, stopping at the next step step 417378, will finish Thu Oct 28 10:39:46 2010 Post-simulation ~15s memtest in progress...done, no errors detected NODE (s) Real (s) (%) Time: 13633.960 71173.931 19.2 3h47:13 (Mnbf/s) (MFlops) (ns/day) (hour/ns) Performance: 0.000 0.003 5.290 4.537 gcq#47: "I Am Testing Your Grey Matter" (Red Hot Chili Peppers) 第七天：相同的体系，相同的设置，以下是用QX9650 四核心CPU 跑的performance: 性能虽然比GTX460弱，但也只是多算了三天时间 —————————————————————————————— Back Off! I just backed up md.trr to ./#md.trr.1# Back Off! I just backed up md.edr to ./#md.edr.1# WARNING: This run will generate roughly 3924 Mb of data starting mdrun 'Good gRace! Old Maple Actually Chews Slate in water' 5000000 steps, 10000.0 ps. step 0 NOTE: Turning on dynamic load balancing step 500, will finish Mon Nov 1 09:57:48 2010vol 0.74 imb F 2% /10月19号早上开始，显示的是11月1号结束 Received the TERM signal, stopping at the next NS step step 550, will finish Mon Nov 1 10:08:34 2010 Average load imbalance: 2.4 % Part of the total run time spent waiting due to load imbalance: 1.1 % Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Y 0 % Parallel run - timing based on wallclock. NODE (s) Real (s) (%) Time: 123.856 123.856 100.0 2:03 (Mnbf/s) (GFlops) (ns/day) (hour/ns) Performance: 156.395 11.835 0.769 31.220 gcq#358: "Now it's filled with hundreds and hundreds of chemicals" (Midlake) 第八天：跑官网上的bench： GTX460 的成绩是102ms/day，与c2050并没有想象的那么大差距！ Pre-simulation ~15s memtest in progress...done, no errors detected starting mdrun 'Protein' -1 steps, infinite ps. step 285000 performance: 102.1 ns/day Received the TERM signal, stopping at the next step step 285028 performance: 102.1 ns/day Post-simulation ~15s memtest in progress...done, no errors detected NODE (s) Real (s) (%) Time: 481.290 482.224 99.8 8:01 (Mnbf/s) (MFlops) (ns/day) (hour/ns) Performance: 0.000 0.002 102.335 0.235 总结：新一代GPU加速的Gromacs分子动力学模拟为我们展示了GPU将来在分子动力学领域应用美好前景，但目前还不成熟。从以上测试中我们可以看出在隐性溶剂水模型的MD模拟计算中，GPU加速的计算性能是传统四核心CPU的至少10倍以上，但是在显性溶剂水模型中，GPU加速未见得多么明显；另外最为重要的一点是，当前版本Gromacs 4.5.1 对于GPU加速MD计算有很多限制，如支持力场有限，许多特性还不支持，模拟的可重复性差(与CPU模拟相比)等，不过在足够长的模拟时间下，还是会生成重复性较好的具有统计学意义的模拟轨迹。相关资料请参考： www.gromacs.org/GPU; 12776 次阅读|0 个评论

SWAN的编译安装: 热度 2 zhouguidi 2010-6-5 16:16; SWAN是著名的第三代海浪模式，以GPL协议发布，支持并行计算，支持多平台。关于SWAN的研究和应用已有很多，这里仅讨论linux系统下编译运行SWAN的方法。安装之前需要做的准备包括安装perl环境和安装fortran编译器。 perl环境一般linux系统都默认安装了，要确认是否安装了可以在终端执行perl -v显示perl的版本，若没有安装则需自行安装。若是debian等基于apt的系统，则可以简单的运行sudo apt-get install perl或者在新立得里搜索perl并标记安装。其他系统（如redhat等）安装同样简单且安装方法网上有很多，在此不再赘述。 fortran编译器可用gfortran或g95。gfortran是gcc编译器的一部分，但同g95一样，一般不默认安装。二者都能通过apt安装。做好上述准备后，在http://130.161.13.149/swan/download/info.htm下载swan for linux源码包，解压，打开终端进入解压后的文件夹，运行命令make config，生成macros.inc文件，这个文件包含了编译需要的一些平台相关的宏。然后运行make ser，将编译生成串行版本的可执行文件。然而编译时出现以下错误（gfortran，其他编译器未测试）： swanout1.f:4073.25: XC, YC, ((JX(JC), JY(JC), WW(JC)), JC=1,4), SUMWW 1 Error: Expected a right parenthesis in expression at (1) 可打开swanout1.f文件，将第4073行((JX(JC), JY(JC), WW(JC)), JC=1,4)的第二层括号去掉即可编译通过。如果需要编译并行版本，需要首先安装mpif90并行编译器：sudo apt-get install mpich2。然后make mpi即可。编译后生成的swan.exe即为模式的可执行文件，执行chmod +x swan.exe将其标记为可执行，然后./swan.exe即可运行模式。模式运行时需要的输入文件是INPUT（不区分大小写，无扩展名），生成的错误信息文件为Errfile。; 个人分类: numerical simulation|10758 次阅读|0 个评论

siesta的并行编译: 热度 1 sduliu 2010-4-13 12:11; 经过一周的努力终于装上了siesta的并行版本，其中的波折。。。遇到的问题。系统：centos5.1 cpu ：intel q6600 编译器：ifort mpich2 （利用ifort 编译）（gfortran 能编译通过siesta但是不能运行，不知道为什么）库的选择：blas lapack 用siesta自己带的 blacs 选择自己编译最难的还是 scalapack的编译自己编译了n次，但是编译siesta的时候出现未定义的字符串形状如下： /usr/local/new/scalapack_installer_0.96/lib/blacsC.a(Cblacs_pinfo.o)(.text+0x65): In function `Cblacs_pinfo': : undefined reference to `bi_f77_get_constants_' 还好在其官网上发现了 scalapack_installer.gz 解压缩后其中有个setup.py的脚本，用于安装和README 中有详细说明我的指令 ./setup.py --mpicc=mpicc --mpif77=mpif77 --f90=ifort --downblas -downlapack --downblacs 完成后可以再lib下找到这几个库的文件。但是其中的 blacs 库在siesta的连接中老是出错。所以我选择重新编译 siesta的编译 1.下载并解压siesta3 2 .进入OBj 键入sh ../Src/configure --enable-mpi (如果装了几种编译器的话注意默认编译器，最好export FC=ifort。。 ) 3 .cp ../Src/MPI/* MPI/ 4. vi arch.make 添加上库的路径我的arch.make # # This file is part of the SIESTA package. # # Copyright (c) Fundacion General Universidad Autonoma de Madrid: # E.Artacho, J.Gale, A.Garcia, J.Junquera, P.Ordejon, D.Sanchez-Portal # and J.M.Soler, 1996- . # # Use of this software constitutes agreement with the full conditions # given in the SIESTA license, as signed by all legitimate users. # .SUFFIXES: .SUFFIXES: .f .F .o .a .f90 .F90 SIESTA_ARCH=x86_64-unknown-linux-gnu--Intel FPP= FPP_OUTPUT= FC=mpif90 RANLIB=ranlib SYS=nag SP_KIND=4 DP_KIND=8 KINDS=$(SP_KIND) $(DP_KIND) FFLAGS=-g FPPFLAGS= -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT LDFLAGS= ARFLAGS_EXTRA= FCFLAGS_fixed_f= FCFLAGS_free_f90= FPPFLAGS_fixed_F= FPPFLAGS_free_F90= BLAS_LIBS=-lblas LAPACK_LIBS=-llapack #BLACS_LIBS=/home/vasp/soft/scalapack_installer_0.96/lib/blacs.a \ # /home/vasp/soft/scalapack_installer_0.96/lib/blacsC.a \ # /home/vasp/soft/scalapack_installer_0.96/lib/blacsF77.a BLACS_LIBS=/home/vasp/soft/BLACS/LIB/blacsCinit_MPI-LINUX-0.a \ /home/vasp/soft/BLACS/LIB/blacsF77init_MPI-LINUX-0.a \ /home/vasp/soft/BLACS/LIB/blacs_MPI-LINUX-0.a SCALAPACK_LIBS=/home/vasp/soft/scalapack_installer_0.96/lib/libscalapack.a COMP_LIBS=dc_lapack.a NETCDF_LIBS= NETCDF_INTERFACE= LIBS=$(SCALAPACK_LIBS) $(BLACS_LIBS) $(LAPACK_LIBS) $(BLAS_LIBS) $(NETCDF_LIBS) #SIESTA needs an F90 interface to MPI #This will give you SIESTA's own implementation #If your compiler vendor offers an alternative, you may change #to it here. MPI_INTERFACE=libmpi_f90.a MPI_INCLUDE=/opt/mpich2-1.0.6/include DEFS_MPI=-DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT #Dependency rules are created by autoconf according to whether #discrete preprocessing is necessary or not. .F.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FPPFLAGS) $(FPPFLAGS_fixed_F) $ .F90.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FPPFLAGS) $(FPPFLAGS_free_F90) $ .f.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FCFLAGS_fixed_f) $ .f90.o: $(FC) -c $(FFLAGS) $(INCFLAGS) $(FCFLAGS_free_f90) $ 5 cd MPI make 生成 mpi_siesta 6 .cd .. make 7 编译完成，生成siesta 运行 mpiexec -n 4 siesta 试验上文假设你已经装好了集群的组件并建立了信任关系问题：mpiexec -n 4 siestaww.fdf ww.out 的时候老是出错，所以我一直在使用mpiexec -n 4 siesta 手动输入不知道什么原因; 个人分类: 未分类|8463 次阅读|1 个评论

Gromacs 4.0.7并行含QM/MM功能安装: 热度 1 albumns 2010-4-11 04:17; Gromacs 4.0.7并行带QM/MM安装平台 SUSE Linux Enterprise Desktop 10 SP3 gcc4.1.2 mpich2-1.2.1p1 ifrot 10.1 fftw 3.2.2 解压缩mpich2-1.2.1p1.tar.gz进入此目录，运行： ./configure make make install 运行 touch /etc/mpd.conf chmod 700 /etc/mpd.conf 将下面加入mpd.conf: secretword=secretword (比如secretword=ltwd) 解压fftw3.2.2压缩后进入目录，安装到/soft/fftw下 ./configure --enable-float --enable-threads make make install 把libmopac.a复制到/soft/fftw/lib和/lib下配置环境变量 setenv CPPFLAGS -I/soft/fftw/include setenv LDFLAGS -L/soft/fftw/lib 解压gromacs4.0.7，进入目录 ./configure --prefix=/soft/gromacs --enable-mpi -enable-fortran --with-qmmm-mopac --enable-shared make make install 配置环境变量 setenv LIBS -lmopac setenv LD_LIBRARY_PATH /soft/gromacs/lib source /soft/gromacs/bin/completion.csh set path=(/soft/gromacs/bin $path) configure中的其他选项 CC C compiler command 一般这个环境变量就是gcc CFLAGS C compiler flags 编译时的参数，一般是-O3 LDFLAGS linker flags, e.g. -Llib dir if you have libraries in a nonstandard directory lib dir 库文件目录 LIBS libraries to pass to the linker, e.g. -llibrary 设的时候不用-l xxx，无需引号 CPPFLAGS C/C++/Objective C preprocessor flags, e.g. -Iinclude dir if you have headers in a nonstandard directory include dir F77 Fortran 77 compiler command 一般这个环境变量就是gfortran或ifort FFLAGS Fortran 77 compiler flags 编译时的参数，一般是-O3 CCAS assembler compiler command (defaults to CC) CCASFLAGS assembler compiler flags (defaults to CFLAGS) CPP C preprocessor CXX C++ compiler command 一般是g++ CXXFLAGS C++ compiler flags CXXCPP C++ preprocessor XMKMF Path to xmkmf, Makefile generator for X Window System Optional Features: --disable-FEATURE do not include FEATURE (same as --enable-FEATURE=no) --enable-FEATURE include FEATURE --enable-shared build shared libraries --disable-float use double instead of single precision --enable-double same effect as --disable-float --enable-fortran use fortran (default on sgi,ibm,sun,axp) --enable-mpi compile for parallel runs using MPI --disable-threads don't try to use multithreading --enable-mpi-environment=VAR only start parallel runs when VAR is set --disable-ia32-3dnow don't build 3DNow! assembly loops on ia32 --disable-ia32-sse don't build SSE/SSE2 assembly loops on ia32 --disable-x86-64-sse don't build SSE assembly loops on X86_64 --disable-ppc-altivec don't build Altivec loops on PowerPC --disable-ia64-asm don't build assembly loops on ia64 --disable-cpu-optimization no detection or tuning flags for cpu version --disable-software-sqrt no software 1/sqrt (disabled on sgi,ibm,ia64) --enable-prefetch-forces prefetch forces in innerloops --enable-all-static make completely static binaries --disable-dependency-tracking speeds up one-time build --enable-dependency-tracking do not reject slow dependency extractors --enable-static build static libraries --enable-fast-install optimize for fast installation --disable-libtool-lock avoid locking (might break parallel builds) --disable-largefile omit support for large files Optional Packages: --with-PACKAGE use PACKAGE --without-PACKAGE do not use PACKAGE (same as --with-PACKAGE=no) --with-fft= FFT library to use. fftw3 is default, fftpack built in. --with-external-blas Use system BLAS library (add to LIBS). Automatic on OS X. --with-external-lapack Use system LAPACK library (add to LIBS). Automatic on OS X. --without-qmmm-gaussian Interface to mod. Gaussian0x for QM-MM (see website) --with-qmmm-gamess use modified Gamess-UK for QM-MM (see website) --with-qmmm-mopac use modified Mopac 7 for QM-MM (see website) --with-gnu-ld assume the C compiler uses GNU ld --with-pic try to use only PIC/non-PIC objects --with-tags include additional configurations --with-dmalloc use dmalloc, as in http://www.dmalloc.com/dmalloc.tar.gz --with-x use the X Window System --with-motif-includes=DIR Motif include files are in DIR --with-motif-libraries=DIR Motif libraries are in DIR --without-gsl do not link to the GNU scientific library, prevents certain analysis tools from being built --with-xml Link to the xml2 library, experimental; 个人分类: 科研笔记|8635 次阅读|3 个评论

Matlab和GPU: zuozw 2009-11-22 19:22; 昨天出去玩，发现同学实验室在用 Jacket 实现Matlab程序在GPU上计算。感觉速度和性能比较好。今下午找到Jacket 网站的用户手册和例子学习了一下，感觉挺有意思的。 Matlab在科研计算的作用是不容置疑的。但当进行大型计算(如从头算法)和没有大型服务器支持时，在个人电脑上运行的时间会特别长，甚至是不可能。 GPU(图形处理器)是显示卡的大脑。与此同时GPU的高性能计算越来越受到重视。 Jacket Engine 是AccelerEyes开发的专门针对MATLAB基于GPU的计算引擎。 AccelerEyes成立于2007年，致力于将GPU科技引入高性能计算（HPC）当中，需要一个强劲的工具连接软件开发人员和GPU硬件之间。当硬件开发者致力于底层的软件工具（如CUDA），以支持他们的设备时，AccelerEyes 开发了高层的接口，完全屏蔽了底层硬件的复杂性。如果买不起大型服务器，可以尝试用GPU计算。学习 Jacket 编程(和M语言一样，只是在函数或循环结构前加个g)，发现其中一点挺有用处的：懒惰计算(Lazy Execution),具体计算一开始不执行，直到最后结果需要才执行计算。这一方法可防止有些数值重复计算和防止误差累积。更多请阅读 1、 Jacket官方网站 2、 GPU让桌面型超级计算机不再是梦想 3、 Tesla－CUDA高性能计算行业应用案例 4、 Tesla高性能计算应用案例－MATLAB、生命科学和医疗成像 5、 GPU加速Matlab高性能计算－Tesla+Jacket Engine解决方案; 个人分类: 科研心得|11777 次阅读|2 个评论

VS2005试验——MPICH2并行release编译: 热度 1 strange007 2009-5-26 15:41; 1.Configuration选择Release 2. Con?guration Properties--C/C++ --General 把 C:\Program Files\MPICH2\include 加入到 Additional Include Directories 中. 3. Con?guration Properties--Linker--General 把 C:\Program Files\MPICH2\lib 加入到 Aditional Library Directories 中. 4. Con?guration Properties--Linker--Input--Additional Dependencies 加入 cxx.lib mpi.lib。注意，这里两个都要加入，而且中间有空格。 (对于简单的，没有C++特性的程序，加入mpi.lib即可，否则会出错。cxx.lib,mpi.lib中间的逗号是有问题的) 对于fortran 加入 fmpich2.lib 到 Additional Dependencies 中. 注意：这里好像加入（cxx.lib mpi.lib）或者（mpi.lib）都可以正确编译出并行release，不过，推荐采用（cxx.lib mpi.lib）。 5. Con?guration Properties--C/C++ --Code Generation -- Runtime Library 选择 Multi-threaded DLL(/MD) 6.在进行编译之前，选择菜单中的Release，Win32 7.可以考虑把原来的.h和.cpp文档用ultraEdit保存为UHF16格式问题：为何不能在实验室台式机上面编译通过的opengems_release.exe并行运行，不能把CFDTD::readIn()进行下去？（Project pre-processing停住）现象： A---E:\VS2005_projects\openGEMS_08_12810_MultiReleaseRun\release，源代码和编译选项完全相同 1.但是在实验室191/190上面进行project设置和release编译，得到的openGEMS_08_12810_MultiReleaseRun.exe既不能单机并行运行 C:\Program Files\MPICH2\bin\mpiexec.exe -n 3 openGEMS_08_1288_MultiReleaseRun.exe dipole_metal_5_5_5.pcf 也不能多机并行运行C:\Program Files\MPICH2\bin\mpiexec.exe -hosts 2 RFLAB190 2 XCG 1 openGEMS_solver_08_1.exe dipole_metal_5_5_5.pcf 2.在宿舍笔记本上完全相同编译的release，可以单机并行，也可以多机并行运行 B---E:\VS2005_projects\openGEMS_08_12812_IO_s_MT\release，编译选项完全相同，而且源代码对fscanf_s和fopen_s进行了改造，但是结果连Project pre-proceing都没有到，就出现parallel错误了而且，这种源代码对fscanf_s和fopen_s进行了改造之后的错误，在实验室和宿舍都会出现，都不能并行运行。 E:\C:\Program Files\MPICH2\bin\mpiexec.exe -n 3 openGEMS_08_12812_IO_s_MT.ex e dipole_metal_5_5_5.pcf Internal Error: invalid error code f8009736 (Ring ids do not match) in MPIDI_CH3 I_Progress_handle_sock_event:420 argv : E:\openGEMS_08_12812_IO_s_MT.exe argv : dipole_metal_5_5_5.pcf 当前工作目录字符串: (E:\) 转换后的char* 类型的目录+文件名字符串: (E:\\dipole_metal_5_5_5.pcf) CFDTD::setFileName 2 of 3 job aborted: rank: node: exit code 0: RFLAB190: -1073741811: process 0 exited without calling finalize 1: RFLAB190: -1073741811: process 1 exited without calling finalize 2: RFLAB190: 123; 个人分类: 未分类|10768 次阅读|1 个评论

更多...

帐号		自动登录	找回密码
密码			注册

关闭 安全验证

标签: 并行

相关帖子

相关日志

关闭安全验证