科学网

 找回密码
  注册

tag 标签: Mining

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

data mining tutorials
justinzhao 2012-12-15 15:08
http://www.autonlab.org/tutorials/ by Prof. Andrew Moore
个人分类: 模式识别|2034 次阅读|0 个评论
[转载]2011 Science MINE: data mining
genesquared 2012-11-12 08:56
Broad Scientists http://www.ebiotrade.com/newsf/2011-12/20111216172130494.htm
个人分类: MINE|1 次阅读|0 个评论
论文阅读笔记:过程挖掘中的挑战性问题
热度 1 maoxianmenlian 2012-11-2 13:33
1 、 Mining hidden tasks 挖掘隐藏任务:假如我们移掉涉及到 A 任务的事件,那么如果我们假定 B 和 C 诗并发的,那么很明显有一个 AND-split ,类似的假如我们去掉 D 任务的相关事件,那么我们就能检测到肯定有个 AND-join 事件。在这个例子中,我们还是可能自动构造出一个类似图 1 的模型(图 2 ),但是对于很多更加复杂的过程,就很难去添加“隐藏任务”了。 2 、 Mining duplicate tasks 挖掘重名任务:是指在一个过程模型中有两个节点都是同一个任务。考虑把表 1 和图 1 中任务 E 更名为 B ,很明显,更改后的日志可以视更改后的过程模型的结果。但是,很难从更改后的表 1 自动构建一个过程模型,因为它不可能分辨出 case5 中的 B 和其它 case 中的 B 。重名任务跟隐藏任务相关,许多有隐藏任务但没重名任务的过程能够改成相同的有重名任务单没有隐藏任务的过程。 3 、 Mining non-free-choice constructs 非自由选择结构:图 1 就是一个自由选择结构,任务 D 的执行在 A 和 E 之间的选择就决定了。图 4 显示了一个非自由选择结构,任务 C 执行后,在任务 D 和任务 E 之间要做个选择,但是 D 和 E 的选择是由更早的选择 A 和 B 决定的,这样任务 D 和 E 都涉及到一个选择,而且两个流可以同步。这种结构很难挖掘,因为选择是“非本地”的,挖掘算法需要记住早些时候的事件。 4 、 Mining loops 循环结构:在一个过程中通一个任务可以被执行许多次,图 5 就显示了循环结构的例子。任务 B 执行完成后,任务 C 呗执行很多次,可能是事件序列是 BD , BCD , BCCD , BCCCD 等等。像任务 C 这种循环结构是很容易发现点额,但是有些过程中一个环可以跳到任意地方。对于更复杂的过程,环的挖掘更加繁琐,因为在一个 case 里的同一个任务就有多种发生顺序。一些技术给每种发生次序编号,如 B1C1C2C3D1 表示成 BCCCD 。这些顺序然后映射到同一任务上。正如图 5 所示,循环结构和重名任务有关系,任务 A 执行了两次,但不在一个环中。 5 、 Using time 时间信息:表 1 显示了用于执行一种过程挖掘的最小信息。每一行都是一个事件。在许多 case 中,日志也包含了时间信息。每个事件都有个时间戳。有开始事件和结束事件的日志就可以计算一个任务的执行时间。时间信息有两个目的: 1 、给过程模型添加时间信息。 2 、改善发现的过程模型的质量。使用时间信息就能相对容易改善一个过程模型的质量。一种方法是首先挖掘过程模型忽略时间戳,然后在过程模型中重放日志,在播放日志时,就很容易计算(平均,方差,最小,最大)流时间,等待时间,过程时间。然后我们会发现一些 case 可能不适合发现的过程模型。这类信息能够被用于改变过程模型(直接更改生成的模型,清理日志,增加知识,重新运行挖掘算法)。使用时间戳信息可以提高日志的质量,例如,如果两个事件发生的间隔很小,那么这两个事件很有可能有因果关系。 6 、 Mining different perspectives 不同视角:过程挖掘主要的角度是从控制流的角度。这个角度的本质是挖掘任务的顺序。从控制流角度也可以扩展来包含时间信息(有时间戳的事件)。除了控制流角度,还有组织角度、信息角度和应用程序角度。组织角度包括组织结构和人员,组织架构描述了各种角色(基于功能的资源类型)和各种组织(基于组织性的资源类型)的关系。资源,从人员到设备组成了组织人口分配给角色和组织。信息角度处理控制和产品数据。控制数据是以过程管理为目的的数据,产品数据是信息体(如文件、公式和表格)不依赖于过程管理而存在。应用程序角度处理用于执行任务的应用(如文本编辑器的使用) 在事件日志中能找到其它视角的痕迹,日志中的事件可以给出资源(如工人)执行相应任务的信息,这个信息能用来导出关于组织视角的知识(如角色和组、协作结构、合作效率等)。如,过程挖掘能够用于寻找两个需要明显长处理时间的特殊工人的案例。日志中可以记录数据流,如在给定事件中更新过程变量。这类日志能够用于导出关于信息视角的知识。将信息视角和控制流视角联系起来是非常有趣的。例如,流时间和特定的过程变量是否有关系呢?(例如大订单需要更长时间)。案例的路径和它的过程变量是否有关系呢?(例如,在某个区域的顾客是否需要一些附加的检查)。大部分的研究都集中在控制流角度的研究,所以,从组织角度、信息角度和应用程序角度过程挖掘将是个很个有趣的挑战。 7 、 Dealing with noise 噪音:大部分挖掘算法假定信息是正确的。虽然这在大部分情况下是个有效的假设,日志仍然可能包含“噪音”,例如,信息记录不正确。有些时间没有记录或者记录的比实际发生的靠后。挖掘算法需要足够健壮以容纳噪声数据,如因果关系不能给予一个单独的观察。挖掘算法需要区分正常数据流中的异常。当考虑噪音数据时,常常要定义一个门限来排除异常或不正确记录的行为。 8 、 Dealing with incompleteness 不完整性:跟噪音数据相关的是信息不完整。如果一个日志的信息不足以得出一个过程那么它就是不完整的。考虑表 1 和得出的正确的实际过程图 1 ,假定 case5 所代表的路径非常罕见,那么当只是挖掘一部分 cases 时,那就会只记录 case1234 ,结果发现的模型不正确,因为 E 和 F 丢失了。这个例子看起来微不足道,在现实生活的过程中当允许并行条件迭代路由时,就很容易达到 100 万个可能的路径。考虑图 6 , 在这个过程中没有选项,所有的任务都执行一次。任务 B 和 9 个任务序列 C1C2……9 同时执行,所以有 10 个可能的路径,例如,即使没有选项,也需要至少 10 个 cases 来推出这个过程模型,事实上,观察 B 执行后 C123……9 序列的执行时不可能的,或许需要上千条日志才恩呢该发现正确的模型。如果我们把图 6 的过程中任务 C1……9 改成并行的,那么就会有 10!=3628800 条路径。在这种情况下,日志就不可能完整,这时就要用到启发式算法来解决这个问题。这些启发式是基于奥卡姆剃须刀原则:如果与偶两个正确的预测理论,那么最简单的那个最好。 9 、 Gathering data from heterogeneous sources( 从异构数据源收集 数据 ) 从异构数据源收集数据:今天的企业信息系统无比复杂,有大量不同的应用程序和组件构成。应用程序支持过程分段,并且过程挖掘需要的信息分散在企业信息系统中,所以收集用于过程挖掘的事件日志是非常繁琐的工作,即使是一个单独的产品,事件也涉及一个系统的各个部分各个层次。我们采用的方法是使用数据仓库从这些日志里提取信息,一些过程挖掘工具建议以 XML 格式为输入格式。 10 、 Visualizing results 可视化结果:另一个挑战是把过程挖掘的结果很直观得显示出来,应该容易理解。在这里我们用“管理驾驶舱”来强调过程挖掘结果展示的关联。现有的商业产品譬如 ARIS 和 PPM 等主要集中在性能指示器如流时间、进展工作等等。可视化完整的控制流视角和其它视角更加困难,需要更多的研究。 11 、 Delta analysis 一致性分析:过程挖掘会生成包括控制流视角或者其它视角的过程模型。尽管如此,也可能已经存在可描述性的规范模型。例如,商业顾问会用图标工具绘制一个过程模型, WFM 系统需要一个清晰的过程模型, ERP 系统在参考模型的基础上配置。有一个人们制定的规范模型后,比较这些模型和过程挖掘的结果模型是很有趣的。一致性分析就用来比较两者的异同。目前很少有技术来检测过程模型的异同。不管是从实用的角度还是从科学角度,一致性分析都是很有趣的,值得更多关注。 参考文献: W.M.P. van der Aalst and A.J.M.M. Weijters. Process Mining: A Research Agenda
个人分类: 过程挖掘|2286 次阅读|2 个评论
【立委科普:自动民调】
热度 3 liwei999 2012-10-19 02:33
【立委科普:自动民调】
Automatic survey complements and/or replaces manual survey. That is the increasingly apparent direction and trend as social media are getting more popular everyday. 自动民调(or 机器民调: Automatic Survey / Machine Survey)指的是利用电脑从语言数据中自动抽取挖掘有关特定话题的民间舆论,其技术领域即所谓舆情挖掘(sentiment mining),通常需要自然语言(NLP)和机器学习(Machine Learning)等技术作为支持。自动民调是对传统的问卷调查一个补充或替代。在社会媒体日益普及的今天,民间情绪和舆论通过微博、博客或论坛等社会媒体管道铺天盖地而来,为了检测、采集和吸收这些舆论,自动民调势在必行,因为手工挖掘面对大数据(big data)已经完全不堪负荷。 民意调查(poll)可以为政府、企业以及民众的决策提供量化情报,应用范围极其广泛。总统大选是一个突出的例子,对于总统候选人本人及其竞选团队,对于选民,民调的结果可以帮助他们调整策略或作出选择。产品发布是企业的例子,譬如 iPhone 5 发布以后,民调的反馈可以帮助苹果及时发现问题。对于有意愿的消费者,民调的结果也有助于他们在购买、等待还是转向别家的决策时,不至于陷入盲目。 相对于传统的以问卷( questionnaire )调查为基础的民调,自动民调有以下几个突出特点。 及时性 。传统民调需要经过一系列过程,设计问卷、派发问卷(通过电话采访、街头采访、有奖刺激等手段)、回收问卷,直到整合归纳,所有程序都须手工进行,因此难以做到及时响应。一个认真的客户产品调查常常需要几天甚至几周时间方可完成。自动民调可以做到立等可取。对于任意话题,使用自动民调系统就像利用搜索引擎一样方便,因为背后的处理机在不分昼夜地自动分析和索引有关的语言资料(通常来自社会媒体)。 高性价 。传统民调的手工性质使得只有舍得不菲的花费,才可以做一项有足够规模的民调(样本小误差就大,难以达到民调的目的)。自动民调是由系统自动完成,同一个系统可以服务不同客户不同话题的各种民调,因此可以做到非常廉价,花费只需传统民调的零头。样本数可以高出手工调查回收数量的n个量级,是传统民调无法企及的。至于话费,通常的商业模式有两种,客户可以订阅(license)这样的系统的使用权,然后可以随时随地对任意多话题做任意多民调。零散客户也可以要求记件使用,每个话题民调一次缴纳多少钱。 客观性 。传统民调需要设计问卷,这就可能有意无意引入主观因素,因此不能完全排除模糊歧义乃至误导的可能。自动民调是自底而上的自动数据分析,用的是归纳整合的方法,因此更加具有客观性。为了达成调查,调查者有时不得不施行物质刺激,这也产生了部分客户纯粹为了奖励而应付调查、返回低质问卷的弊端。自动民调的对象是民意的自然流露(水军和恶意操纵另论),基数大,也有利于降噪,这就保障了情报的客观性。 对比性 。这一点特别重要,因为几乎任何话题的民调,都需要竞争对手或行业的背景。正面反面的舆论,问题的严重性等等,只有通过对比才能适当体现。譬如民调奥巴马的总统竞选效益,离不开对比其对手罗梅尼。客户调查 ATT 手机网络的服务,离不开比较其竞争者 Verizon,等。很多品牌实际上需要与一系列同类品牌做对比,才好确定其在市场的地位(如上图所示)。这种对比民调,虽然在理论上也可以手工进行,但是由于手工民调耗时耗力耗钱,很多时候调查者不得不减少或者牺牲对于竞争对手的调查,利用有限的资源只做对本企业的品牌调查。可自动调查就不同了,多话题的调查和对比是这类产品设计的题中应有之义,可以轻易完成。 自动民调也有挑战,主要挑战在于人为噪音:面对混乱的社会媒体现实,五毛、水军以及恶意舆论的泛滥,一个有效的舆情系统必须不断与垃圾作战。好在这方面,搜索引擎领域已经积攒了丰富的经验可以借鉴。另一个挑战是需要对网络世界做两类媒体的分类(所谓push/pull的媒体分野)。民意调查切忌混入“长官意志”,客户情报一定要与商家宣传分开:同是好话,商家是王婆卖瓜,客户才是上帝下旨。这种媒体分类可以结合来源(sources)、语气(宣传类材料常常是新闻官方语气,而客户评价则多用口语和网络语)来决定,是有迹可寻的。 总之,在互联网的时代,随着社会媒体的深入民间,民间情绪和舆论的表达越来越多地诉诸于社会媒体。因此,民调自动化势必成为未来民调的方向和主流。其支持技术也基本成熟,大规模多语言的应用指日可待。 【相关篇什】 奥巴马赢了昨晚辩论吗?舆情自动检测告诉你 社会媒体舆情自动分析:马英九 vs 陈水扁 舆情自动分析表明,谷歌的社会评价度高出百度一倍 方韩大战的舆情自动分析 【置顶:立委科学网博客NLP博文一览(定期更新版)】 立委名言:技术改变世界,甚至总统......乃至你我。
个人分类: 立委科普|8191 次阅读|5 个评论
Mining Frequent & Maximal Reference Sequences with GST
xiaohai2008 2011-10-18 13:59
Web usage mining (WUM) is the type of Web mining activity that involves the automatic discovery of user access patternsfrom huge Web access logs. In this study, we analyze deeply generalized suffix tree data structure in WUM situations andexplain in detail the reasons why a linear-time traversal on the generalized suffix tree can obtain frequent referencesequences. The key point is that due to the special nature of transactions, for each internal node v, the total number ofleaves in the sub-tree of v is exactly the number of distinct (navigation-content) transaction identifiers that appear at theleaves in the sub-tree of v. After that, with the help of generalized suffix tree, an algorithm on mining maximal referencesequences is proposed. Experimental results indicate that our approach is feasible and has good scalability. 原文见: 2010_6_7_2187_2197.pdf
个人分类: 日志挖掘|2872 次阅读|0 个评论
《数据挖掘概念与技术》摘要笔记
jiangdm 2011-10-11 14:56
《数据挖掘 概念与技术》,Jiawei Han, M Kamber 方体计算 1)基于ROLAP的方体计算 2)基于数组的(MOLAP) 3)自底向上 4)H-Cubing 重点: chapter 4 Multi-Way 多路数组聚集 面向属性的归纳
个人分类: ML|1 次阅读|0 个评论
review: Skyline 查询处理
jiangdm 2011-9-13 19:45
Skyline 查询处理 魏小娟, 杨婧, 李翠平, 陈红 软件学报 ,2008 综述 modify history 1) 2012-2-22 摘要: 对目前的Skyline 查询方法进行分类和综述.首先介绍Skyline 查询处理问题产生的背景,然后介绍Skyline查询处理的内存算法,并从带索引和不带索引两个方面对现有的外存Skyline 查询处理方法进行分类介绍,在每组算法后,都对该组算法进行了性能评价,然后介绍不同子空间上的多Skyline 查询处理模型——SKYCUBE 的概念和相关研究.另外,还介绍了不同应用环境下解决Skyline 查询处理的策略以及Skyline 查询处理问题的扩展,最后归结出Skyline 查询处理后续研究的几个方向. 关键词: Skyline 查询;SP;控制关系;多目标优化;SKYCUBE Skyline 查询 {others: Convex Hulls、Top-K 查询、Nearest Neighbor 查询} Skyline 查询处理 : 指从给定的一个D-维空间对象集合S 中选择一个子集,该子集中的任意一个点都不能被S 中的任意一个其他点所控制. Skyline 查询 多目标优化问题 Skyline 查询研究分类: 1-) 单Skyline 查询处理算法. 2-) 多Skyline 查询处理算法. 3-) 不同应用环境下的Skyline 查询处理. 4-) Skyline 查询处理问题的扩展 the organization of this paper: 1) 第1 节介绍Skyline查询处理问题产生的背景、动机和应用场景,并对问题的定义进行具体描述. 2) 第2 节详述单Skyline 的查询处理算法,包括内存处理算法和外存处理算法、带索引和不带索引的处理算法等. 3) 第3 节具体介绍多Skyline 的查询处理算法,包括SKYCUBE 的计算、维护和压缩等. 4) 第4 节介绍不同应用环境下Skyline 的查询处理算法. 5) 第5节介绍Skyline 查询处理问题的扩展. 6) 最后总结全文,并展望未来的研究工作. 1 Skyline 查询处理问题 1.1 问题描述 Skyline 查询问题: Pareto 最优或极大向量问题. 1.2 Skyline查询和Top-K查询 多目标优化问题solutions: 1-) 转化成单目标优化的问题(Top-K 查询) 主要思想: 通过一个单调的加权函数将对象集合S中每个对象的多个属性进行聚集,得到一个单一值,通常被称为Score 值,然后将所有的对象按照其Score 值进行排序,再选出前K 个最大或最小的对象,即为所要的查询结果. 2-) Pareto 方法(Skyline 查询). 不是将问题转化为单目标优化的问题,而是直接采用多目标优化算法解决原始的多目标问题.多目标优化算法最终返回的结果集是一系列平行的、互不受控制的解(位于Skyline 上的多个SP). 3-) 词典序方法. 即为不同的属性安排不同的优先级,然后按照优先级的高低来选择各个属性进行优化.按照优先级从高到低的顺序依次比较对象相应属性值的大小,若其属性值相同,则继续比较各个对象在下一个优先级上的属性值大小,直到比较出明确的优劣为止. 2 单Skyline 查询处理算法 2.1 内存处理算法 2.2 外存处理算法 3 多Skyline 查询处理算法 3.1 SKYCUBE 3.2 压缩的SKYCUBE 4 不同应用环境下的Skyline 查询处理算法 4.1 Web信息系统 a Web based distributed Skyline algorithm: 4.2 P2P网络 a P2P based Skyline algorithm 4.3 数据流 4.4 公路网络 5 Skyline 查询处理问题的扩展 5.1 高维空间下的Skyline查询处理问题 Top-k Frequent Skyline Computation 5.3 Skyline查询和Top-k查询的结合 6 总结和展望 个人点评: 一篇入门综述,从写作技能看,一般。从内容广度而言,表明作者花了一番功夫 beamer_Skyline查询处理.pdf beamer_Skyline查询处理.tex Skyline查询处理.pdf
个人分类: ML|1 次阅读|0 个评论
review: the quest data mining system
jiangdm 2011-8-16 22:29
《the quest data mining system》, Rakesh Agrawal, Manish Mehta, John Shafer, Ramakrishnan Srikant KDD-96 Proceedings 文献类型: 工程总结 研究目标: the Quest project -- enable a new breed of daintensive decision-support applications. 研究方法: 比基尼: 难点 重点 疑点 个人点评: 早期重要的工程实践性论文,在以后看Data mining时关注 文章不足之处: 作者其它文献脉络 相关重要文献 the quest data mining system.pdf beamer_quest_data_mining_system.pdf beamer_quest_data_mining_system.tex
个人分类: AI & ML|0 个评论
CCF优秀博士论文
jiangdm 2011-8-16 09:45
CCF优秀博士论文
contents 1 2012 2 2012 集合型离散粒子群优化及其在项目资源调度的应用研究.pdf 数字集成电路时序偏差的在线检测和容忍.pdf 非确定图数据的挖掘算法研究.pdf 多视图在利用未标记数据中的效用分析.pdf 不确定图数据查询处理技术的研究.pdf 无线传感器网络拓扑识别与构建技术研究.pdf 基于图模型表达和稀疏特征选择的图像语义理解.pdf 近似查询的有效性及性能优化问题研究.pdf
个人分类: ML|149 次阅读|0 个评论
review: Web智能研究现状与发展趋势
jiangdm 2011-8-10 22:46
《Web智能研究现状与发展趋势》,王本年 高阳 陈世福 谢俊元 计算机研究与发展,2005 摘要 web智能是近年出现的一个崭新的研究方向,它是人工智能和高级信息技术在新的Web和 Internet环境下相互融合的产物.首先从总体上讨论了Web智能的概念、研究内容和功能技术框架,然 后分别就Web智能的几个核心方面的研究现状进行了综述,主要包括语义Web与ontology,Web Agent 和web挖掘等,并进一步给出了它们的研究重点和发展方向,最后是关于Web智能的研究展望和面临 的挑战,指出智慧Web是web智能研究的目标和中长期发展方向. 关键词 Web智能;语义Web;Web挖掘;Web Agent;智慧Web 个人点评: Web Intelligence:WI = AI + IT 本文提出以下热点:(2005年) Semantic Web AND Ontology Web Agent Web Mining 个人看好Web Mining,有实用性;前两个热点现在已降温了 8月11日 补充原来看过 《计算Web智能研究综述》,段其国 苗夺谦 陈敏 王瑞志 计算机科学 2007 摘要 计算w曲智能是近年来提出的一个崭新的研究方向。它结合了计算智能和Web技术,致力于提高Intemet 和无线网蝽上电子商务等Web应用的智能化程度。首先分析了计算Web智能的研究背景,然后阐递了计算web智能的概忿和相关技术,概括了计算Web智能当前的主要研究内容和应用,最后展望了计算Web智能未来的研究方向厦面临的挑战。 关键词 计算Web智能,计算智能,Web Agent,粗鞋集,粒度计算 计算Web智能研究综述.pdf beamer_web_intelligence.pdf
个人分类: AI & ML|3259 次阅读|0 个评论
review: 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH
jiangdm 2011-8-8 12:03
《10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH》,QIANG YANG, XINDONG WU International Journal of Information Technology Decision Making ,2006 Abstract: In October 2005, we took an initiative to identify 10 challenging problems in data mining research, by consulting some of the most active researchers in data mining and machine learning for their opinions on what are considered important and worthy topics for future research in data mining. We hope their insights will inspire new research efforts, and give young researchers (including PhD students) a high-level guideline as to where the hot problems are located in data mining. Due to the limited amount of time, we were only able to send out our survey requests to the organizers of the IEEE ICDM and ACM KDD conferences, and we received an overwhelming response. We are very grateful for the contributions provided by these researchers despite their busy schedules. This short article serves to summarize the 10 most challenging problems of the 14 responses we have received from this survey. The order of the listing does not reflect their level of importance. Keywords: Data mining; machine learning; knowledge discovery. 个人点评: 数据挖掘前瞻性论文,有价值!可惜文章未附文献 beamer_10_CHALLENGING_PROBLEMS_DATA_MINING_RESEARCH.pdf beamer_10_CHALLENGING_PROBLEMS_DATA_MINING_RESEARCH.tex 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH.pdf XINDONG WU的写得关于10 CHALLENGING的 PPT,写得不错 Data Mining Opportunities and Challenges.ppt
个人分类: AI & ML|0 个评论
review: Knowledge Discovery in Databases: An Overview
jiangdm 2011-8-4 15:52
《Knowledge Discovery in Databases: An Overview》, William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus, AAAI ,1992 Abstract: After a decade of fundamental interdisciplinary research in machine learning, the spadework in this field has been done; the 1990s should see the widespread exploitation of knowledge discovery as an aid to assembling knowledge bases. The contributors to the AAAI Press book \emph{Knowledge Discovery in Databases} were excited at the potential benefits of this research. The editors hope that some of this excitement will communicate itself to AI Magazine readers of this article the goal of this article: This article presents an overview of the state of the art in research on knowledge discovery in databases. We analyze Knowledge Discovery and define it as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. We then compare and contrast database, machine learning, and other approaches to discovery in data. We present a framework for knowledge discovery and examine problems in dealing with large, noisy databases, the use of domain knowledge, the role of the user in the discovery process, discovery methods, and the form and uses of discovered knowledge. We also discuss application issues, including the variety of existing applications and propriety of discovery in social databases. We present criteria for selecting an application in a corporate environment. In conclusion, we argue that discovery in databases is both feasible and practical and outline directions for future research, which include better use of domain knowledge, efficient and incremental algorithms, interactive systems, and integration on multiple levels. 个人点评: 一篇老些的经典数据挖掘综述,个人认为本文两个入脚点:一是Machine Learning (Table 1,2),二是文中 Figure 1 Knowledge Discovery in Databases Overview.pdf beamer_Knowledge_Discovery_Database_Overview.pdf beamer_Knowledge_Discovery_Database_Overview.tex
个人分类: AI & ML|1 次阅读|0 个评论
[转载]Social Network Analysis and Mining
热度 1 rbwxy197301 2011-5-2 17:52
Social Network Analysis and Mining ISSN: 1869-5450 (print version) ISSN: 1869-5469 (electronic version) Journal no. 13278 The rapid increase in the interest in social networks has motivated the need for a more specialized venues with wider spectrum capable of meeting the needs and expectations of a variety of researchers and readers. Social Network Analysis and Mining (SNAM) is intended to be a multidisciplinary journal to serve both academia and industry as a main venue for a wide range of researchers and readers from social sciences, mathematical sciences, medical and biological sciences and computer science. We solicit experimental and theoretical work on social network analysis and mining using different techniques from sociology, social sciences, mathematics, statistics and computer science. The main areas covered by SNAM include: (1) data mining advances on the discovery and analysis of communities, personalization for solitary activities (like search) and social activities (like discovery of potential friends), the analysis of user behavior in open forums (like conventional sites, blogs and forums) and in commercial platforms (like e-auctions), and the associated security and privacy-preservation challenges; (2) social network modeling, construction of scalable, customizable social network infrastructure, identification and discovery of dynamics, growth, and evolution patterns using machine learning approaches or multi-agent based simulation. Papers should elaborate on data mining or related methods, issues associated to data preparation and pattern interpretation, both for conventional data (usage logs, query logs, document collections) and for multimedia data (pictures and their annotations, multi-channel usage data). Topics include but are not limited to: Web community Personalization for search and for social interaction Recommendations for product purchase information acquisition and establishment of social relations Recommendation networks Data protection inside communities Misbehaviour detection in communities Preparing data for web mining Pattern presentation for end-users and experts Evolution of communities in the Web Community discovery in large-scale social networks Dynamics and evolution patterns of social networks, trend prediction Contextual social network analysis Temporal analysis on social networks topologies Search algorithms on social networks Multi-agent based social network modeling and analysis Large-scale graph algorithms Applications of social network analysis Anomaly detection in social network evolution Related subjects » Applications - Complexity - Database Management Information Retrieval - Ecology - Game Theory / Mathematical Methods - Social Sciences 资料来源:http://www.springer.com/computer/database+management+%26+information+retrieval/journal/13278
个人分类: 学术期刊|6062 次阅读|1 个评论
[转载]Cross-Lingual Text Mining
timy 2011-2-14 23:34
Encyclopedia of Machine Learning Springer Science+Business Media, LLC2011 10.1007/978-0-387-30164-8_189 ClaudeSammut and GeoffreyI.Webb Cross-Lingual Text Mining NicolaCancedda and Jean-MichelRenders (1) Xerox Research Centre Europe, Meylan, France Without Abstract Definition Cross-lingual text mining is a general category denoting tasks and methods for accessing the information in sets of documents written in several languages, or whenever the language used to express an information need is different from the language of the documents. A distinguishing feature of cross-lingual text mining is the necessity to overcome some language translation barrier. Motivation and Background Advances in mass storage and network connectivity make enormous amounts of information easily accessible to an increasingly large fraction of the world population. Such information is mostly encoded in the form of running text which, in most cases, is written in a language different from the native language of the user. This state of affairs creates many situations in which the main barrier to the fulfillment of an information need is not technological but linguistic. For example, in some cases the user has some knowledge of the language in which the text containing a relevant piece of information is written, but does not have a sufficient control of this language to express his/her information needs. In other cases, documents in many different languages must be categorized in a same categorization schema, but manually categorized examples are available for only one language. While the automatic translation of text from a natural language into another (machine translation) is one of the oldest problems on which computers have been used, a palette of other tasks has become relevant only more recently, due to the technological advances mentioned above. Most of them were originally motivated by needs of government Intelligence communities, but received a strong impulse from the diffusion of the World-Wide Web and of the Internet in general. Tasks and Methods A number of specific tasks fall under the term of Cross-lingual text mining (CLTM), including: • Cross-language information retrieval • Cross-language document categorization • Cross-language document clustering • Cross-language question answering These tasks can in principle be performed using methods which do not involve any Text Mining , but as a matter of fact all of them have been successfullyapproached relying on the statistical analysis of multilingual document collections,especially parallel corpora . While CLTM tasks differ in many respect, they are allcharacterized by the fact that they require to reliably measure the similarity oftwo text spans written in different languages. There are essentially two families ofapproaches for doing this: 1. In translation-based approaches one of the two text spans is first translated into the language of the other. Similarity is then computed based on any measure used in mono-lingual cases. As a variant, both text spans can be translated in a third pivot language. 2. In latent semantics approaches, an abstract vector space is defined based on the statistical properties of a parallel corpus (or, more rarely, of a comparable corpus ). Both text spans are then represented as vectors in such latent semantic space, where any similarity measure for vector spaces can be used. The rest of this entry is organized as follows: first Translation-related approaches will be introduced, followed by Latent-semantic approaches. Finally, each of the specific CLTM tasks will be discussed in turn. Translation-Based Approaches The simplest approach consists in using a manually-written machine-readable bilingual dictionary: words from the first span are looked up and replaced with words in the second language (see e.g., Zhang Vines, 2005 ). Since typically dictionaries contain entries for “citation forms” only (e.g., the singular for nouns, the infinitive for verbs etc.), words in both spans are preliminarily lemmatized , i.e., replaced with the corresponding citation form. In all cases when the lexica and morphological analyzers required to perform lemmatization are not available, a frequently adopted crude alternative consists in stemming (i.e., truncating by taking away a suffix) both the words in the span to be translated and in the corresponding side in the lexicon. Some languages (e.g., Germanic languages) are characterized by a very productive compounding : simpler words are connected together to form complex words. Compound words are rarely in dictionaries as such: in order to find them it is first necessary to break compounds into their elements. This can be done based on additional linguistic resources or by means of heuristics, but in all cases it is a challenging operation in itself. If the method used afterward to compare the two spans in the target language can take weights into account, translations are “normalized” in such a way that the cumulative weight of all translations of a word is the same regardless of the number of alternative translations. Most often, the weight is simply distributed uniformly among all alternative translations. Sometimes, only the first translation for each word is kept, or the first two or three. A second approach consists in extracting a bilingual lexicon from a parallel corpus instead of using a manually-written one. Methods for extracting probabilistic lexica look at the frequencies with which a word s in one language was translated with a word t to estimate the translation probability p ( t | s ). In order to determine which word is the translation of which other word in the available examples, these examples are preliminarily aligned, first at the sentence level (to know what sentence is the translation of what other sentence) and then at the word level. Several methods for aligning sentences at the word level have been proposed, and this problem is a lively research topic in itself (see Brown, Della Pietra, Della Pietra, Mercer, 1993 for a seminal paper). Once a probabilistic bilingual dictionary is available, it can be used much in the same way as human-written dictionaries, with the notable difference that the estimated conditional probabilities provide a natural way to distribute weight across translations. When the example documents used for extracting the bilingual dictionaries are of the same style and domain as the text spans to be translated, this can result in a significant increase in accuracy for the final task, whatever this is. It is often the case that a parallel corpus sufficiently similar in topic and style to the spans to be translated is unavailable, or it is too small to be used for reliably estimating translation probabilities. In such cases, it can be possible to replace or complement the parallel corpus with a “comparable” corpus. A comparable corpus is a pair of collections of documents, one in each of the languages of interest, which are known to be similar in content, although not the translation of one another. A typical case might be two sets of articles from corresponding sections of different newspapers collected during a same period of time. If some additional bilingual seed dictionary (human-written or extracted from a parallel corpus) is also available, then the comparable corpus can be leveraged as well: a word t is likely to be the translation of a word s if it turns out that the words often appearing near s are translations of the words often appearing near t . Using this observation it is thus possible to estimate the probability that t is a valid translation of s even though they are not contained in the original dictionary. Most approaches proceed by associating with s a context vector . This vector, with one component for each word in the source language, can simply be formed by summing together the count histograms of the words occurring within a fixed window centered in all occurrences of s in the corpus, but is often constructed using statistically more robust association measures, such as mutual information. After a possible normalization step, the context vector CV ( s ) is translated using the seed dictionary into the target language. A context vector is also extracted from the corpus for all target words t . Eventually, a translation score between s and t is computed as 〈 Tr ( CV ( s )), CV ( t )〉: where a is the association score used to construct the context vector. While effective in many cases, this approach can provide inaccurate similarity values when polysemous words and synonyms appear in the corpus. To deal with this problem, Gaussier, Renders, Matveeva, Goutte, and Déjean ( 2004 ) propose the following extension: which is more robust in cases when the entries in the seed bilingual dictionary do not cover all senses actually present in the two sides of the comparable corpus. Although these methods for building bilingual dictionaries can be (and often are) used in isolation, it can be more effective to combine them. Using a bilingual dictionary directly is not the only way for translating a span from one language into another. A second alternative consists in using a machine translation (MT) system. While the MT system, in turn, relies on a bilingual dictionary of some sort, it is in general in the position of leveraging contextual clues to select the correct words and put them in the right order in the translation. This can be more or less useful depending on the specific task. MT systems fall, broadly speaking, into two classes: rule-based and statistical. Systems in the first class rely on sets of hand-written rules describing how words and syntactic structures should be translated. Statistical machine translation (SMT) systems learn this mapping by performing a statistical analysis of a parallel corpus. Some authors (e.g., Savoy Berger, 2005 ) also experimented with combining translation from multiple machine translation systems. Latent Semantic Approaches In CLTM, Latent Semantic approaches rely on some interlingua (language-independent) representation. Most of the time, this interlingua representation is obtained by linear or non-linear statistical analysis techniques and more specifically dimensionality reduction methods with ad-hoc optimization criterion and constraints. But, others adopt a more manual approach by exploiting multilingual thesauri or even multilingual ontologies in order to map textual objects towards a list– possibly weighted – of interlingua concepts. For any textual object (typically a document or a section of document), the interlingua concept representation is derived from a sequence of operations that encompass: 1. Linguistic preprocessing (as explained in previous sections, this step amounts to extract the relevant, normalized “terms” of the textual objects, by tokenisation, word segmentation/decompounding, lemmatisation/stemming, part-of-speech tagging, stopword removal, corpus-based term filtering, Noun-phrase extractions, etc.). 2. Semantic enrichment and/or monolingual dimensionality reduction. 3. Interlingua semantic projection. A typical semantic enrichment method is the generalized vector space model , that adds related terms– or neighbour terms – to each term of the textual object, neighbour terms being defined by some co-occurrence measures (for instance, mutual information). Semantic enrichment can alternatively be achieved by using (monolingual) thesaurus, exploiting relationships such as synonymy, hyperonymy and hyponymy. Monolingual dimensionality reduction consists typically in performing some latent semantic analysis (LSA), some form of principal component analysis on the textual object/term matrix. Dimensionality reduction techniques such as LSA or their discrete/probabilistic variants such as probabilistic semantic analysis (PLSA) and latent dirichlet allocation (LDA) offer to some extent a semantic robustness to deal with the effects of polysemy/synonymy, adopting a language-dependent concept representation in a space of dimension much smaller than the size of the vocabulary in a language. Of course, steps (1) and (2) are highly language-dependent. Textual objects written in different languages will not follow the same linguistic processing or semantic enrichment/ dimensionality reduction. The last step (3), however, aims at projecting textual objects in the same language-independent concept space, for any source language. This is done by first extracting these common concepts, typically from a parallel corpus that offers a natural multiple-view representation of the same objects. Starting from these multiple-view observations, common factors are extracted through the use of canonical correlation analysis (CCA), cross-language latent semantic analysis, their kernelized variants (eg. Kernel-CCA) or their discrete, probabilistic extensions (cross-language latent dirichlet allocation, multinomial CCA, …). All these methods try to discover latent factors that simultaneously explain as much as possible the “intra-language” variance and the “inter-language” correlation. They differ in the choice of the underlying distributions and how they precisely define and combine these two criteria. The following subsections will describe them in more details. As already emphasized, CLTM mainly relies on defining appropriate similarities between textual objects expressed in different languages. Numerous categorization, clustering and retrieval algorithms focus on defining efficient and powerful measures of similarity between objects, as strengthened recently by the development of kernel methods for textual information access. We will see that the (linear) statistical algorithms used for performing steps (2) and (3) can most of the time be embedded into one valid (Mercer) kernel, so that we can very easily obtain non-linear variants of these algorithms, just by adopting some standard non-linear kernels. Cross-Language Semantic Analysis This amounts to concatenate the vectorial representation of each view of the objects of the parallel collection (typically, objects are aligned sentences), and then to perform standard singular value decomposition of the global object/term matrix. Equivalently, defining the kernel similarity matrix between all pairs of multi-view objects as the sum of the mono-lingual textual similarity matrices, this amounts to perform the eigenvalue decomposition of the corresponding kernel Gram matrix, if a dual formulation is adopted. The number of eigenvalues/eigenvectors that are retained to define the latent factors and the corresponding projections is typically from several hundreds of components to several thousands, still much fewer than the original sizes of the vocabulary. Note that this process does not really control the formation of interlingua concepts: nothing prevents the method from extracting factors that are linear combination of terms in one language only. Cross-Language Latent Dirichlet Allocation The extraction of interlingua components is realised by using LDA to model the set of parallel objects, by imposing the same proportion of components (topics) for all views of the same object. This is represented in Fig. 1 . Cross-Lingual Text Mining. Figure1 Latent dirichlet allocation of a parallel corpus LDA is performing some form of clustering, with a predefined number of components ( K ) and with the constraint that the two views of the same object belongs to the clusters with the same membership values. This results in 2. K component profiles that are then used for “folding in” (projecting) new documents by launching some form of EM to derive their posterior probabilities to belong to each of the language-independent component. The similarity between two documents written in different languages is obtained by comparing their posterior distribution over these latent classes. Note that this approach could easily integrate supervised topic information and provides a nice framework for semi-supervised interlingua concept extraction. Cross-Language Canonical Correlation Analysis The Primal Formulation CCA is a standard statistical method to perform multi-block multivariate analysis, the goal being to find linear combinations of variables for each block (i.e., each language) that are maximally correlated. In other words, CCA is able to enforce the commonality of latent concept formations by extracting maximally correlated projections. Starting from a set of paired views of the same objects (typically, aligned sentences of a parallel corpus) in languages L1 and L2, the algebraic formulation of this optimization problem leads to a generalized eigenvalue problem of size ( n 1 + n 2), where n 1 and n 2 are the sizes of the vocabularies in L1 and L2 respectively. For obvious scalability reasons, the dual – or kernel – formulation (of size N , the number of paired objects in the training set) is often preferred. Kernel Canonical Correlation Analysis Basically, Kernel Canonical Correlation Analysis amounts to do CCA on some implicit, but more complex feature space and to express the projection coefficients as linear combination of the training paired objects. This results in the dual formulation, which is a generalized eigenvalue/vector problem of size 2 N , that involves only the monolingual kernel gram matrices K 1 and K 2 (matrices of monolingual textual similarities between all pairs of objects in the training set in language L1 and L2 respectively). Note that it is easy to show that the eigenvalues go by pairs: we always have two symmetrical eigenvalues + λ and − λ. This kernel formulation has the advantage to include any text specific prior properties in the kernel (e.g., use of N-gram kernels, word-sequence kernels, and any semantically-smoothed kernel). After extraction of the first k generalized eigenvalues/eigenvectors, the similarity between any pair of test objects in languages L1 and L2 can be computed by using projection matrices composed of extracted eigenvector as well as the (monolingual) kernels of the test objects with the training objects. Regularization and Partial Least Squares Solution When the number of training examples ( N ) is less than n 1 and n 2 (the dimensions of the monolingual feature spaces), the eigenvalue spectrum of the KCCA problem has generally two null eigenvalues (due to data centering), ( N − 1) eigenvalues in + 1 and ( N − 1) eigenvalues in − 1, so that, as such, the KCCA problem only results in trivial solutions and is useless. When using kernel methods, the case ( N n 1, n 2) is frequent, so that some regularization scheme is needed. One way of realizing this regularization is to resort to finding the directions of maximum covariance (instead of correlation): this can be considered as a partial least squares (PLS) problem, whose formulation is very similar to the CCA problem. Adopting a mixed criterion CCA/PLS (trying to maximize a combination of covariance and correlation between projections) turns out to both avoid over-fitting (or spurious solutions) and to enhance numerical stability. Approximate Solutions Both CCA and KCCA suffer from a lack of scalability, due to the fact the complexity of generalized eigenvalue/vector decomposition is O ( N 3) for KCCA or O (min( n 1, n 2)3) for CCA. As it can be shown that performing a complete KCCA (or KPLS) analysis amounts to do first complete PCA’s, and then a linear CCA (or PLS) on the resulting new projections, it is obvious that we could reduce the complexity by working on a reduced-rank approximation (incomplete KPCA) of the kernel matrices. However, the implicit projections derived from incomplete KPCA may be not optimal with respect to cross-correlation or covariance criteria. Another idea to decrease the complexity is to perform some incomplete Cholesky decomposition of the (monolingual) kernel matrices K 1 and K 2 (that is equivalent to partial Gram-Schmit orthogonalisation in the feature space): K 1 = G 1. G 1 t and K 2 = G 2. G 2 t , with G i of rank k ≪ N . Considering G i as the new representation of the training data, KCCA now reduces to solving a generalized eigenvalue problem of size 2. k . Specific Applications The previous sections illustrated a number of different ways of solving the core problem of cross-language text mining: quantifying the similarity between two spans of text in different languages. In this section we turn to describing some actual applications relying on these methods. Cross-Language Information Retrieval (CLIR) Given a collection of documents in several languages and a single query, the CLIR problem consists in producing a single ranking of all documents according to their relevance to the query. CLIR is in particular useful whenever a user has some knowledge of the languages in which documents are written, but not enough to express his/her information needs in those languages by means of a precise query. Sometimes CLIR engines are coupled with translation tools to help the user access the content of relevant documents written in languages unknown to him/her. In this case document collections in an even larger number of languages can be effectively queried. It is probably fair to say that the vast majority of the CLIR systems use a translation-based approach. In most cases it is the query which is translated in all languages before being sent to monolingual search engines. While this limits the amount of translation work that needs be done, it requires doing it on-line at query time. Moreover, when queries are short it can be difficult to translate them correctly, since there is little context to help identifying the correct sense in which words are used. For these reasons several groups also proposed translating all documents at indexing time instead. Regardless of whether queries or documents are translated, whenever similarity scores between (possibly translated) queries and (possibly translated) documents are not directly comparable, all methods then face the problem of merging multiple monolingual rankings in a single multilingual ranking. Research in CLIR and cross-language question answering (see below) has been significantly stimulated by at least three government-sponsored evaluation campaigns: • The NII Test Collection for IR Systems (NTCIR) ( http://research.nii.ac.jp/ntcir/ ), running yearly since 1999, focusing on Asian languages (Japanese, Chinese, Korean) and English. • The Cross-Language Evaluation Forum (CLEF) ( http://www.clef-campaign.org ), running yearly since 2000, focusing on European languages. • A cross-language track at the Text Retrieval Conference (TREC) ( http://trec.nist.gov/ ), which was run until 2002, focused on querying documents in Arabic using queries in English. The respective websites are ideal starting points for any further exploration on the subject. Cross-Language Question Answering (CLQA) Question answering is the task of automatically finding the answer to a specific question in a document collection. While in practice this vague description can be instantiated in many different ways, the sense in which the term is mostly understood is strongly influenced by the task specification formulated by the National Institute of Science and Technology (NIST) of the United States for its TREC evaluation conferences (see above). In this sense, the task consists in identifying a text snippet , i.e., a substring, of a predefined maximal length (e.g., 50 characters, or 200 characters) within a document in the collection containing the answer. Different classes of questions are considered: • Questions around facts and events. • Questions requiring the definition of people, things and organizations. • Questions requiring as answer lists of people, objects or data. Most proposals for solving the QA problem proceed by first identifying promising documents (or document segments) by using information retrieval techniques treating the question as a query, and then performing some finer-grained analysis to converge to a sufficiently short snippet. Questions are classified in a hierarchy of possible “question types.” Also, documents are preliminarily indexed to identify elements (e.g., person names) that are potential answers to questions of relevant types (e.g., “Who” questions). Cross-language question answering (CLQA) is the extension of this task to the case where the collection contains documents in a language different than the language of the question. In this task a CLIR step replaces the monolingual IR step to shortlist promising documents. The classification of the question is generally done in the source language. Both CLEF and NTCIR (see above) organize cross-language question answering comparative evaluations on an annual basis. Cross-Language Categorization (CLCat) and Clustering (CLCLu) Cross-language categorization tackles the problem of categorizing documents in different languages in a same categorization scheme. The vast majority of document categorization systems rely on machine learning techniques to automatically acquire the necessary knowledge (often referred to as a model ) from a possibly large collection of manually categorized documents. Most often the model is based on frequency counts of words, and is thus intrinsically language-dependent. The most direct way to perform categorization in different languages would consist in manually categorizing a sufficient amount of documents in all languages of interest and then train a set of independent categorizer. In some cases, however, it is impractical to manually categorize a sufficient number of documents to ensure accurate categorization in all languages, while it can be easier to identify bilingual dictionaries or parallel (or comparable) corpora for the language pairs and in the application domain of interest. In such cases it is then preferable to obtain manually categorized documents only for a single language A and use them to train a monolingual categorizer. Any of the translation-based approaches described above can then be used to translate a document originally in language B – or most often its representation as a bag of words – into language A . Once the document is translated, it can be categorized using the monolingual A system. As an alternative, latent-semantics approaches can be used as well. An existing parallel corpus can be used to identify an abstract vector space common to A and B . The manually categorized documents in A can then be represented in this space, and a model can be learned which operates directly on this latent-semantic representation. Whenever a document in B needs be categorized, it is first projected in the common semantic space and then categorized using the same model. All these considerations carry unchanged to the cross-language clustering task, which consists in identifying subsets of documents in a multilingual document collection which are mutually similar to one another according to some criterion. Again, this task can be effectively solved by either translating all documents into a single language or by learning a common semantic space and performing the clustering task there. While CLCat and Clustering are relevant tasks in many real-world situations, it is probably fair to say that less effort has been devoted to them by the research community than to CLIR and CLQA. Recommended Reading Brown, P. E., Della Pietra, V. J., Della Pietra, S. A., Mercer,R.L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 12 (2), 263–311. Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., Déjean, H. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd annual meeting of the association for computational linguistics , Barcelona, Spain. Morristown, NJ: Association for Computational Linguistics. Savoy, J., Berger, P. Y. (2005). Report on CLEF-2005 evaluation campaign: Monolingual, bilingual and GIRT information retrieval. In Proceedings of the cross-language evaluation forum (CLEF) (pp. 131–140). Heidelberg: Springer. Zhang, Y., Vines, P. (2005). Using the web for translation disambiguation. In Proceedings of the NTCIR-5 workshop meeting , Tokyo, Japan.
个人分类: 文本挖掘|0 个评论
[转载]Data Mining资源大全
wlp8631 2010-10-19 20:12
Data Mining资源大全 默认分类 2009-07-18 21:16:43 阅读 134 评论 0 字号: 大 中 小 订阅 Da ta Mining: What Is Da ta Mining ? http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm Da ta Mining - An Introduction http://databases.about.com/library/weekly/aa100700a.htm?iam=excite_1terms=da ta+mining Da ta Mining - An Introduction Student Notes http://www.pcc.qub.ac.uk/tec/courses/datamining/stu_notes/dm_book_1.html Da ta Mining Overview http://www.megaputer.com/dm/index.php3 Da ta Mining - Award Winning Software http://www.salford-systems.com/?source=goto Da ta Mining With MicroStrategy Best In Business Intelligence http://www.microstrategy.com/Software/Mining.asp?CID=1818dm Da ta Mining, Web Mining and Knowledge Discovery Directory http://www.kdnuggets.com/ Da ta Miners Home Page http://www.da ta-miners.com/ Da ta Mining and Knowledge Discovery Journal http://www.digimine.com/usama/datamine/ Da ta Mining and Knowledge Discovery Journal http://www.kluweronline.com/issn/1384-5810 Effective Da ta Mining Technology http://www.enablesoft.com/ Find Da ta Mining Solutions http://www.knowledgestorm.com/SearchServlet?ksAction=keyMapx=da ta+miningsite=Overture Da ta Mining Solutions - Business Intelligence http://www.netsoft-usa.com/01_bi.aspx Da ta Mining Resources http://databases.about.com/cs/datamining/index.htm?PM=ss15_databases The Da ta Mine Information Index About Da ta Mining http://www.the-da ta-mine.com/ ITtoolbox Business Intelligence http://businessintelligence.ittoolbox.com/ Mining Da ta For Actionable Business Decisions http://internet.about.com/library/aa_da ta_mining_041202.htm?iam=excite_1terms=da ta+mining The Da ta Mining Group http://www.dmg.org/ Da ta Mining Software http://www.knowledgestorm.com/SearchServlet?ksAction=keyMapx=Da ta+Mining+Softwaresite=LOOKSMART IBM Da ta Mining Project/Group Quest http://www.almaden.ibm.com/cs/quest/ Da ta Mining Resources http://psychology.about.com/cs/datamining/index.htm?iam=excite_1terms=da ta+mining Da ta Mining, Text Mining and Web Mining Software http://www.megaputer.com/ Da ta Mining and Da ta Warehousing Links http://databases.about.com/cs/datamining/index.htm?iam=excite_1terms=da ta+mining Da ta Mining Software : EDM DMSK http://www.da ta-miner.com/ Da ta Mining and Knowledge Discovery In Databases http://db.cs.sfu.ca/sections/publication/kdd/kdd.html DM Review: Strategic Solutions For Business Intelligence http://www.dmreview.com/ Da ta, Text and Web Mining http://internet.about.com/cs/datamining/index.htm?iam=excite_1terms=da ta+mining First SIAM International Conference On Da ta Mining http://www.siam.org/meetings/sdm01/ Da ta Mining 2002 International Conference On Da ta Mining Methods and Databases For Engineering, http://www.wessex.ac.uk/conferences/2002/datamining02/ SIGKDD - ACM Special Interest Group On Knowledge Discovery and Da ta Mining http://www.acm.org/sigkdd/ Da ta Mining News http://www.idagroup.com/ NCDM National Center For Da ta Mining http://www.ncdm.uic.edu/ Da ta Mining Benchmarking Association (DMBA) http://www.dmbenchmarking.com/ Da ta Mining In Molecular Biology http://industry.ebi.ac.uk/~brazma/dm.html Da ta Mining and Machine Learning http://www.cs.helsinki.fi/research/fdk/datamining/ NCBI Tools For Da ta Mining http://www.ncbi.nlm.nih.gov/Tools/ Guide Your Organization's Future With Da ta Mining http://www.spss.com/spssbi/applications/datamining/ URLs For Da ta Mining http://www.galaxy.gmu.edu/stats/syllabi/DMLIST.html Generate maximum return on da ta in minimum time with Clementine http://www.spss.com/spssbi/clementine/ ICDM'02 The 2002 IEEE International Conference On Da ta Mining http://kis.maebashi-it.ac.jp/icdm02/ DMI: Da ta Mining Institute http://www.cs.wisc.edu/dmi/ Da ta Mining On The Web http://www.webtechniques.com/archives/2000/01/greening/ Da ta Mining Lecture Notes http://www-db.stanford.edu/~ullman/mining/mining.html ITSC Da ta Mining Center http://datamining.itsc.uah.edu/ Imperial College Da ta Mining Research Group http://ruby.doc.ic.ac.uk/ Knowledge Discovery Da ta Mining Foundation http://www.kdd.org/ Untangling Text Da ta Mining http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html Directory Of Da ta Warehouse, Da ta Mining and Decision Support Resources http://www.infogoal.com/dmc/dmcdwh.htm Da ta Mining Techniques http://www.statsoftinc.com/textbook/stdatmin.html Knowledge Discovery In Biology and Medicine http://bioinfo.weizmann.ac.il/cards/knowledge.html SAS Analytic Intelligence Da ta Text Mining http://www.sas.com/technologies/da ta_mining/ Analysis of Da ta Mining Algorithms http://userpages.umbc.edu/~kjoshi1/da ta-mine/proj_rpt.htm BIOKDD, 2001 Workshop On Da ta Mining In Bioinformatics http://www.cs.rpi.edu/~zaki/BIOKDD01/ Advances In Knowledge Discovery and Da ta Mining http://www.aaai.org/Press/Books/Fayyad/fayyad.html On line Program In Da ta Mining http://www.ccsu.edu/datamining/ Da ta Mining: Concepts Techniques (Book) 2000 http://www.cs.sfu.ca/~han/DM_Book.html Tutorial On High Performance Da ta Mining http://www-users.cs.umn.edu/~mjoshi/hpdmtut/ GMDH Group Method Of Da ta Handling http://www.gmdh.net/ The Serendip Da ta Mining Project http://www.bell-labs.com/project/serendip/ Da ta Mining Forum http://www.da ta-mining-forum.de/ Open Directory: Da ta Mining http://dmoz.org/Computers/Software/Databases/Da ta_Mining/ Da ta Warehouse Information Center - Da ta Mining http://www.dwinfocenter.org/datamine.html Da ta Mining Magazine http://www.mining.dk/ Da ta Mining Server http://dms.irb.hr/ NAG Da ta Mining Components to Create Critical Competitive Advantage http://www.nag.co.uk/numeric/DR/drdescription.asp Da ta Mining and Multidimensional Analysis http://www.ics.uci.edu/~eppstein/gina/datamine.html ADC's Da ta Mining Resources For Space Science http://adc.gsfc.nasa.gov/adc/adc_datamining.html Laboratory For Knowledge Discovery In Databases (KDD) http://www.kddresearch.org/Groups/Da ta-Mining/ NCSA Da ta, Mining and Visualization http://archive.ncsa.uiuc.edu/DMV/ CRoss Industry Standard Process For Da ta Mining http://www.crisp-dm.org/ International Workshop On Visual Da ta Mining http://www-staff.it.uts.edu.au/~simeon/vdm_pkdd2001/ Mathematic Challenges In Scientific Da ta Mining http://www.ipam.ucla.edu/programs/sdm2002/ Mining Customer Da ta http://www.db2mag.com/db_area/archives/1998/q3/98fsaar.shtml Constraint-Based Multidimensional Da ta Mining http://www-sal.cs.uiuc.edu/~hanj/pdf/computer99.pdf 什么是数据挖掘 http://www.seamlessit.com/documents/DataMiner/DM2002-05-24A.htm 数据挖掘-技术与应用 http://www.seamlessit.com/documents/DataMiner/DM2002-05-24B.htm 数据挖掘助竞争 http://www.cai.com.cn/suc_story/0426.htm 数据挖掘讨论组 http://www.dmgroup.org.cn/ 数据挖掘在CRM中的应用 http://www.chinabyte.com/20020726/1622396.shtml Open Miner 数据挖掘工具 http://www.neusoft.com/UploadFile/0.4.3/217/217.htm 数据挖掘-概念与技术(影印书) http://www.hep.edu.cn/books/computer/photocopy/20.html 数据挖掘在科学数据库中的应用探索 http://www.sdb.ac.cn/thesis/thesis5/paper/p6.doc 数据挖掘概述 (一) http://www.ccf-dbs.org.cn/pages_c/datamining1.htm 数据挖掘概述 (二) http://www.ccf-dbs.org.cn/pages_c/datamining2.htm 数据挖掘在CRM中的核心作用 http://www.cndata.com/sjyw/dcd_knowlege/texts/article491.asp 网络数据挖掘 http://www.pcworld.com.cn/2000/back_issues/2014/1436a.asp 构建面向CRM的数据挖掘应用 2001 人民邮电出版社 http://www.e-works.net.cn/business/category18/126700621324531250.html 数据挖掘在CRM中的应用 http://www.e-works.net.cn/ewkArticles/Category38/Article9809.htm 数据挖掘及其工具的使用 http://eii.dlrin.edu.cn/zjlw/zhlw17.htm 数据挖掘-极具发展前景的新领域 http://www.creawor.com/biforum/bi_02.htm 数据挖掘的研究现状 http://www.creawor.com/biforum/bi_03.htm 数据挖掘-数据库技术的新时代 http://www.china-pub.com/computers/emook/1188/info.htm XML 与面向Web的数据挖掘技术 http://www.aspcool.com/lanmu/browse1.asp?ID=719bbsuser=xml http://www.swm.com.cn/rj/2000-10/25.htm http://www.ccidnet.com/tech/web/2001/09/04/58_3176.html 上海市计算机学会数据挖掘技术讨论网站 http://scs.stc.sh.cn/main/sjwj.htm 数据挖掘与统计工作 http://www.bjstats.gov.cn/zwxx/wzxw/zzwz/200207020115.htm 数据仓库、数据集市和数据挖掘 http://eii.dlrin.edu.cn/zjlw/zhlw16.htm 数据挖掘-图书馆员应掌握的基本工具 http://www.zslib.com.cn/xhlw/wk.doc 数据挖掘技术概述 http://www.china-pub.com/computers/emook/0903/info.htm 数据挖掘及其在工程诊断中的应用(博士论文) http://www.monitoring.com.cn/papers/GaoYilong_C_D.htm 本文来自CSDN博客,转载请标明出处: http://blog.csdn.net/evane1890/archive/2007/12/19/1954152.aspx 本文引用地址: http://www.sciencetimes.com.cn/m/user_content.aspx?id=242094
个人分类: 数据挖掘|34 次阅读|0 个评论
[转载]Datasets for Data Mining
openmind 2010-8-7 10:01
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html University Homepage School Homepage School Contacts School Search Datasets for Data Mining