http://www.bioon.com/biology/cancer/409960.shtml 来源 生物通 2009-9-24 9:38:06 JCO:新方法提高癌症预测准确率 生物谷 来自中山大学肿瘤防治中心戎铁华教授等通过肿瘤分子信息和数据挖掘方法可以预测早期非小细胞肺癌5年内是否死亡,这一预测的总正确率达到87.2%,研究成果在杂志 Journal of Clinical Oncology 上发表,该杂志影响因子达到15.484。 肺癌主要分为小细胞肺癌及非小细胞肺癌,在两种主要的肺癌类型中,非小细胞肺癌约占75%,是造成肺癌相关死亡的主要原因。目前即使是外科手术疗效较好的早期非小细胞肺癌病人,其5年生存率也仅在40%~70%之间,意味着30%~60%的病人会在5年内局部复发或远处转移。现今医学界广泛应用的pTNM分期系统难以准确地预测非小细胞肺癌患者的预后,对于个体化的预后预测更是束手无策。 戎铁华教授带领的学术团队从1996年开始探索新方法。他们利用组织芯片和免疫组织化学技术对大样本量的早期肺癌中可能和预后相关的30多种分子标记物进行了检测,结合病人的临床病理特征及预后资料,并且和中国科技大学数据挖掘专家合作,用支持向量机方法筛选构建三种早期肺癌个体化预后预测模型,并且对三种模型进行了验证。免疫组织化学方法具有较强的稳定性和可重复性,对标本处理的要求比较低,而且实验的费用相对比较低廉。目前这一研究成果已经得到国际同行的初步肯定。 据悉,该研究成果之所以得到国内外同行的认可,主要是因为类似的用基因预测癌症的检测方法非常昂贵,如美国临床应用的 乳腺癌 70个基因检测收费就达到4200美元,而且基因特征与中国人有区别;戎铁华教授课题组用来预测早期非小细胞肺癌预后的诊断手段,成本只需几百元人民币,有利于该技术的推广和应用。 该技术一旦成熟,今后每个肺癌病人开刀做完手术后,可借此预测其5年存活情况。预后情况好的就不必再做放疗化疗,减少痛苦和负担;预后差的病人则要研究及时辅做化疗、放疗或者生物治疗。而且检测费用远比国外的基因检测便宜。( 生物谷 Bioon.com) 生物谷推荐原始出处: Journal of Clinical Oncology , 10.1200/JCO.2009.24.0929 Reply to F.C. Detterbeck Tie-Hua Rong and Zhi-Hua Zhu Cancer Center of Sun Yat-Sen University, Guangzhou, People's Republic of China
聚类是数据挖掘中用来发现数据分布和隐含模式的一项重要技术 。作为一种常见的数据分析工具和无监督机器学习方法,聚类的目的是把数据集合分成若干类(或簇),使得每个类中的数据之间最大限度地相似,而不同类中的数据最大程度地不同。根据聚类算法所采用的基本思想,大致可以将它们分为五种 ,即划分聚类、层次聚类、基于密度的聚类、基于网格的聚类和基于模型的聚类。目前对聚类算法的研究正在不断深入,其中核聚类算法和谱聚类算法是近年来受到广泛关注的两种算法 。 核聚类方法的主要思想是通过一个非线性映射,将输入空间中的数据点映射到高维特征空间中,并选取合适的 Mercer 核函数代替非线性映射的内积,在特征空间中进行聚类。该方法是普适的,它比经典的聚类方法有较大的改进。它通过非线性映射增加了数据点线性可分的概率,即能较好地分辨、提取并放大有用的特征,从而实现更为准确的聚类,算法收敛速度也较快。在经典聚类算法失效的情况下,核聚类算法常常能得到较好的聚类 结果 。 支持向量聚类( Support Vector Clustering, SVC )属于核聚类的一种,它以支持向量机( Support Vector Machine, SVM )为工具进行聚类 。它是 Ben-Hur 等在基于高斯核的 SVDD ( Support Vector Domain Description )算法基础上进一步发展起来的无监督非参数型的聚类算法 。它的基本思想是:利用高斯核,将数据空间中的数据点映射到一个高维的特征空间中。再在特征空间中寻找一个能包围所有数据点象的半径最小的球,将这个球映回到数据空间,则得到了包含所有数据点的等值线集。这些等值线就是簇的边界。每一条闭合等值线包围的点属于同一个簇 。 SVC 算法主要分为两个阶段: SVC 训练阶段和聚类分配阶段。其中 SVC 训练阶段包括高斯核宽度系数的确定、核矩阵的计算、 Lagrange 乘子的计算、支持向量的选取和高维特征空间中特征球半径的计算。聚类分配阶段首先生成邻接矩阵,然后根据邻接矩阵进行聚类分配 。 SVC 算法具有两大显著优势:能产生任意形状的簇边界;能分析噪声数据点且能分离相互交叠的簇。这是许多聚类算法无法做到的。但 SVC 算法仍存在两个瓶颈: Lagrange 乘子的计算和邻接矩阵的计算。相对而言,后者需要消耗的计算时间远比前者多 。因此很多新的 SVC 算法都旨在提高邻接矩阵的计算效率 。 参考文献 Xu R, Wunsch D. Survey of Clustering Algorithms. IEEE Transaction on Neural Networks, 2005, 16(3): 645-678. Han J, Kamber M. Data Mining: Concepts and Techniques, Second Edition. Morgan Kaufmann, San Francisco , 2006. Filippone M, Camastra F, Masulli F, Rovetta S. A Survey of Kernel and Spectral Methods for Clustering. Pattern Recognition, 2008, 41(1): 176-190. 张莉,周伟达,焦李成 . 核聚类算法 . 计算机学报 , 2002, 25(6): 587-590. Burges C J C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 1998, 2(2) : 121-167. Tax D M J, Duin R P W. Support Vector Domain Description. Pattern Recognition Letters, 1999, 20(11-13): 1191-1199. Ben-Hur A, Horn D, Siegelmann H T, Vapnik V. Support Vector Clustering. Journal of Machine Learning Research, 2001, 2(12): 125-137. Scholkopf B, Williamson R, Smola A, Shawe-Taylor J, Platt J. Support Vector Method for Novelty Detection. Advances in Neural Information Processing System 12. 2000: 582-588. 吕常魁,姜澄宇,王宁生 . 一种支持向量聚类的快速算法 . 华南理工大学学报 . 2005, 33(1): 6-9. Lee J, Lee D. An Improved Cluster Labeling Method for Support Vector Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005, 27(3): 461-464. Camastra F, Verri A. A Novel Kernel Method for Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005, 27(5):801-805.
Data Mining with Ontologies: Implementations, Findings, and Frameworks 来源于: https://igi-pub.com/reference/details.asp?ID=6844v=preface Edited By: Hector Oscar Nigro , Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina; Sandra Elizabeth Gonzalez Cisaro , Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina; Daniel Hugo Xodo , Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina Preface: Data mining, also referred to as knowledge discovery in databases (KDD), is a process of finding new, interesting, previously unknown, potentially useful, and ultimately understandable patterns from very large volumes of data. Data mining is a discipline which brings together database systems, statistics, artificial intelligence, machine learning, parallel and distributed processing and visualization between other disciplines (Fayyad et al., 1996; Hand Kamber, 2001; Hernadez Orallo et al., 2004). Nowadays, one of the most important and challenging problems in data mining is the definition of the prior knowledge; this can be originated from the process or the domain. This contextual information may help select the appropriate information, features or techniques, decrease the space of hypothesis, represent the output in a most comprehensible way and improve the whole process. Therefore we need a conceptual model to help represent to this knowledge. According to Gruber's ontology definition?explicit formal specifications of the terms in the domain and relations among them (Gruber, 1993, 2002); we can represent the knowledge of knowledge discovery process and knowledge about domain. Principally, ontologies are used for communication (between machines and/or humans), automated reasoning, and representation and reuse of knowledge (Cimiano et al., 2004). As a result, ontological foundation is a precondition for efficient automated usage of knowledge discovery information. Thus, we can perceive the relation between Ontologies and data mining in two manners: From ontologies to data mining, we are incorporating knowledge in the process through the use of ontologies, i.e. how the experts comprehend and carry out the analysis tasks. Representative applications are intelligent assistants for discover process (Bernstein et al., 2001, 2005), interpretation and validation of mined knowledge, Ontologies for resource and service description and knowledge Grids (Cannataro et al., 2003; Brezany et al., 2004). From data mining to Ontologies, we include domain knowledge in the input information or use the ontologies to represent the results. Therefore the analysis is done over these ontologies. The most characteristic applications are in medicine, biology and spatial data, such as gene representation, taxonomies, applications in geosciences, medical applications and specially in evolving domains (Langley, 2006; Gottgtroy et al., 2003, 2005; Bogorny et al., 2005). When we can represent and include knowledge in the process through ontologies, we can transform data mining into knowledge mining. Data Mining with Ontologies Cycle Figure 1 shows our vision of data mining with ontologies cycle. Metadata ontologies : These ontologies establish how this variable is constructed i.e. which was the process that permit us to obtain its value, and it can vary using another method. Of course this ontology must also express general information about the variable as is treated. Domain ontologies : These ontologies express the knowledge about application domain. Ontologies for data mining process : These ontologies codify all knowledge about the process, i.e., select features, select the best algorithms according to the variables and the problem, and establish valid process sequences (Bernstein, 2001, 2005; Cannataro, 2003, 2004). According with Gomez-Perez and Manzano-Macho (2003) the different methods and approaches, which allow the extraction of ontologies or semantics from database schemas can be classified on three areas, main goal, techniques used and sources used for learning. With regard to the attributes of each area they are the following for summary of ontology learning methods from relational schema are: Main goal To map a relational schema with a conceptual schema To create (and refine) an ontology To create ontological instances (from a database) Enhance ad hoc queries Techniques used Mappings Reverse engineering Induction inference Rule generation Graphic modeling Sources used for learning Relational schemas (of a database) Schema of domain specific databases Flat files Relational databases In next paragraphs we explain in more detail these three classes of ontologies based on earlier works from different knowledge fields. Domain Ontology The models on many scientists work to represent their work hypotheses are generally cause effect diagrams. Models make use of general laws or theories to predict or explain behavior in specific situations. Currently these cause effect diagrams can be without difficulty translated to ontologies, by means of conceptual maps which discriminate taxonomy organized as central concepts, main concept, secondary concepts, specific concepts. Discovery systems produce models that are valuable for prediction, but they should also produce models that have been stated in some declarative format, that can be communicated clearly and precisely, which helps people understand observations, in terms that they find well known (Bridewell, 2006; Langley, 2002, 2006). Models can be from different appearances and dissimilar abstraction level, but the more complex the fact for which they account, the more important that they be cast in some formal notation with an unambiguous interpretation. And of course these new knowledge can be easily communicated and updated between systems and Knowledge databases. In particular into data mining field knowledge can be represented in different formalisms, e.g. rules, decision trees, cluster, known as models. Discovery systems should generate knowledge in a format that is well known to domain users. There are an important relation between knowledge structures and discovery process with learning machine. The formers are important outputs of discovery process, and are important inputs to discovery (Langley, 2000). Thus knowledge plays as crucial a role as data in the automation of discovery. Therefore, ontologies provide a structure capable of supporting the knowledge representation about domain. Metadata Ontologies As Spyns et al. (2002) affirm ontologies in current computer science language are computer-based resources that represent agreed domain semantics. Unlike data models, the fundamental asset of ontologies is their relative independence of particular applications, i.e., an ontology consists of relatively generic knowledge that can be reused by different kinds of applications/tasks. In opposition a data model represents the structure and integrity of the data elements of the, in principle ?single?, specific enterprise application(s) by which it will be used. Consequently, the conceptualization and the vocabulary of a data model are not intended a priori to be shared by other applications (Gottgtroy et al., 2005). Similarly, in data modeling practice, the semantics of data models often constitute an informal accord between the developers and the users of the data model?including when a data warehouse is designedand, in many cases, the data model is updated as it evolves when particular new functional requirements pop up without any significant update in the metadata repository. Both ontology model and data model have similarities in terms of scope and task. They are context dependent knowledge representation, that is, there doesn?t exist a strict line between generic and specific knowledge when you are building ontology. Moreover, both modeling techniques are knowledge acquisition intensive tasks and the resulted models represent partial account of conceptualizations (Gottgtroy et al., 2003). In spite of the differences, we should consider the similarities and the fact of data models carry a lot of useful hide knowledge about the domain in its data schemas, in order to build ontologies from data and improve the process of knowledge discovery in databases. Due the fact data schemas do not have the required semantic knowledge to intelligently guide ontology construction has been presented as a challenge for database and ontology engineers (Gottgtroy et al., 2003). Ontologies for Data Mining Process Vision about KDD process is changing over time. In its beginnings the main objective was to extract a valuable pattern from a fat file as a play of try and error. As time goes by, researchers and fundamentally practitioners discuss the importance of a priori knowledge, the knowledge and understandability about the problem, the choice of the methodology to do the discovery, the expertise in similar situations and an important question arises up to what existent is such inversion on data mining projects worthwhile? As practitioners and researchers in this field we can perceive that expertise is very important, knowledge about domain is helpful and it simplify the process. To do more attractive the process to managers the practitioners must do it more efficiently and reusing experience. So we can codify all statistical and machine learning knowledge with ontologies and use it. Bernstein et al. (2001) have developed the concept of intelligent assistant discovery (IDA), which helps data miners with the exploration of the space of valid data mining processes. It takes advantage of an explicit ontology of data-mining techniques, which defines the various techniques and their properties. Main characteristics are (Bernstein et al., 2005). A systematic enumeration of valid DM processes, so they do not miss important, potentially fruitful options. Effective rankings of these valid processes by different criteria, to help them choose between the options. An infrastructure for sharing data mining knowledge, which leads to what economists call network externalities. Cannataro and colleagues have done another interesting contribution to this kind of ontologies. They developed an ontology that can be used to simplify the development of distributed knowledge discovery applications on the Grid, offering to a domain expert a reference model for the different kind of data mining tasks, methodologies and software available to solve a given problem, helping a user in finding the most appropriate solution (Cannataro et al., 2003, 2004). Authors have adopted the Enterprise Methodology (Corcho et al., 2003). Research Works in the Topic The next paragraphs will describe the most recently research works in data mining with ontologies field. Singh, Vajirkar, and Lee (2003) have developed a context aware data mining framework which provide accuracy and efficacy to data mining outcomes. Context factors were modeled using ontological representation. Although the context aware framework proposed is generic in nature and can be applied to most of the fields, the medical scenario provided was like a proof of concept to our proposed model. Hotho, Staab and Stumme (2003) have showed that using ontologies as filters in term selection prior to the application of a K-means clustering algorithm will increase the tightness and relative isolation of document clusters as a measure of improvement. Pand and Shen (2005) have proposed architecture for knowledge discovery in evolving environments. The architecture creates a communication mechanism to incorporate known knowledge into discovery process, through ontology service facility. The continuous mining is transparent to the end user; moreover, the architecture supports logical and physical data independence. Rennolls (2005, p. 719) have developed an intelligent framework for data mining, knowledge discovery and business intelligence. The ontological framework will guide to user to choice of models from an expanded data mining toolkit, and the epistemological framework will assist to user in interpreting and appraising the discovered relationships and patterns. On domain ontologies, Pan and Pan (2006) have proposed ontobase ontology repository. It is an implementation, which allows users and agents to retrieve ontologies and metadata through open Web standards and ontology service. Key features of the system include the use of XML metadata interchange to represent and import ontologies and metadata, the support for smooth transformation and transparent integration using ontology mapping and the use of ontology services to share and reuse domain knowledge in a generic way. Recently, Bounif et al. (2006) have explained the articulation of a new approach for database schema evolution and outline the use of domain ontology. The approach they have proposed belongs to a new tendency called the tendency of a priori approaches. It implies the investigation of potential future requirements besides the current requirements during the standard requirements analysis phase of schema design or redesign and their inclusion into the conceptual schema. Those requirements are determined with the help of a domain ontology called ?a requirements ontology? using data mining techniques and schema repository. Book Organization This book is organized into three major sections dealing respectively with implementations, findings, and frameworks. Section I : Implementations includes applications or study cases on data mining with ontologies. Chapter I , TODE: An Ontology-Based Model for the Dynamic Population of Web Directories by Sofia Stamou, Alexandros Ntoulas, and Dimitris Christodoulakis studies how we can organize the continuously proliferating Web content into topical categories, also known as Web directories. Authors have implemented a system, named TODE that uses Topical Ontology for Directories? Editing. Also TODE?s performance is evaluated; experimental results imply that the use of a rich topical ontology significantly increases classification accuracy for dynamic contents. Chapter II , Raising, to Enhance Rule Mining in Web Marketing with the Use of an Ontology by Xuan Zhou and James Geller introduces Raising as an operation which is used as a preprocessing step for data mining. Rules have been derived using demographic and interest information as input for data mining. The Raising step takes advantage of interest ontology to advance data mining and to improve rule quality. Furthermore, the effects caused by Raising are analyzed in detail, showing an improvement of the support and confidence values of useful association rules for marketing purposes. Chapter III , Web Usage Mining for Ontology Management by Brigitte Trousse, Marie-Aude Aufaure, B?n?dicte Le Grand, Yves Lechevallier, and Florent Masseglia proposes an original approach for ontology management in the context of Web-based information systems. Their approach relies on the usage analysis of the chosen Web site, in complement of the existing approaches based on content analysis of Web pages. One major contribution of this chapter is then the application of usage analysis to support ontology evolution and/or web site reorganization. Chapter IV , SOM-Based Clustering of Multilingual Documents Using an Ontology by Minh Hai Pham, Delphine Bernhard, Gayo Diallo, Radja Messai, and Michel Simonet presents a method which make use of Self Organizing Map (SOM) to cluster medical documents. The originality of the method is that it does not rely on the words shared by documents but rather on concepts taken from ontology. The goal is to cluster various medical documents in thematically consistent groups. Authors have compared the results for two indexing schemes: stem-based indexing and conceptual indexing. Section II : Findings comprise more theoretical aspects of data mining with ontologies such as ontologies for interpretation and validation and domain ontologies. Chapter V , Ontology-Based Interpretation and Validation of Mined Knowledge: Normative and Cognitive Factors in Data Mining by Ana Isabel Canhoto, addresses the role of cognition and context in the interpretation and validation of mined knowledge. She proposes the use of ontology charts and norm specifications to map how varying levels of access to information and exposure to specific social norms lead to divergent views of mined knowledge. Domain knowledge and bias information influence which patterns in the data are deemed as useful and, ultimately, valid. Chapter VI , Data Integration Through Protein Ontology by Amandeep S. Sidhu, Tharam S. Dillon, and Elizabeth Chang discuss conceptual framework of Protein Ontology that has a hierarchical classification of concepts represented as classes, from general to specific; a list of attributes related to each concept, for each class; a set of relations between classes to link concepts in ontology in more complicated ways than implied by the hierarchy, to promote reuse of concepts in the ontology; and a set of algebraic operators to query protein ontology instances. Chapter VII , TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology by Josiane Mothe and Nathalie Hernandez introduces a method re-using a thesaurus built for a given domain, in order to create new resources of a higher semantic level in the form of an ontology. The originality of the method is that it is based on both the knowledge extracted from a thesaurus and the knowledge semiautomatically extracted from a textual corpus. In parallel, authors have developed mechanisms based on the obtained ontology to accomplish a science-monitoring task. An example is provided in this chapter. Chapter VIII , Evaluating the Construction of Domain Ontologies for Recommender Systems Based on Texts by Stanley Loh, Daniel Lichtnow, Thyago Borges, and Gustavo Piltcher, investigates different aspects in the construction of domain ontology to a content-based recommender system. The chapter discusses different approaches so as to construct the domain ontology, including the use of text mining software tools for supervised learning, the interference of domain experts in the engineering process and the use of a normalization step. Section III : Frameworks includes different architectures for different domains in data warehousing or mining with ontologies context. Chapter IX , by Vania Bogorny, Paulo Martins Engel, and Luis Otavio Alvares introduces the problem of mining frequent geographic patterns and spatial association rules from geographic databases. A large amount of natural geographic associations are explicitly represented in geographic database schemas and geo-ontologies, which have not been used so far in frequent geographic pattern mining. The main goal of this chapter is to show how the large amount of knowledge represented in geo-ontologies as prior knowledge can be used to avoid the extraction of patterns previously known as noninteresting. Chapter X , Ontology-Based Construction of Grid Data Mining Workflows by Peter Brezany, Ivan Janciak, and A Min Tjoa, introduces an ontology-based framework for automated construction of complex interactive data mining workflows. The authors present their solution called GridMiner Assistant (GMA), which addresses the whole life cycle of the knowledge discovery process. In addition, conceptual and implementation architectures of the framework are presented and its application to an example taken from the medical domain is illustrated. Chapter XI , Ontology-Based Data Warehousing and Mining Approaches in Petroleum Industries by Shastri L. Nimmagadda and Heinz Dreher. Complex geo-spatial heterogeneous data structures complicate the accessibility and presentation of data in petroleum industries. Data warehousing approach supported by ontology will be described for effective data mining. Ontology based data warehousing framework with fine-grained multidimensional data structures facilitates mining and visualization of data patterns, trends, and correlations hidden under massive volumes of data. Chapter XII , A Framework for Integrating Ontologies and Pattern-Bases by Evangelos Kotsifakos, Gerasimos Marketos, and Yannis Theodoridis propose the integration of pattern base management systems (PBMS) and ontologies. It is as a solution to the need of many scientific fields for efficient extraction of useful information from large databases and the exploitation of knowledge. Authors use a case study of data mining over scientific (seismological) data to illustrate their proposal. Book Objective This book aims at publishing original academic work with high quality scientific papers. The key objective is to provide to data mining students, practitioners, professionals, professors and researchers an integral vision of the topic. This book specifically focuses on those areas that explore new methodologies or examine real study cases that are ontology-based The book describes the state-of-the-art, innovative theoretical frameworks, advanced and successful implementations as well as the latest empirical research findings in the area of data mining with ontologies. Audience The target audience of this book is readers who want to learn how to apply data mining based on ontologies to real world problems. The purpose is to show users how to go from theory and algorithms to real applications. The book is also geared toward students, practitioners, professionals, professors and researchers with basic understanding in data mining. The information technology community can increase its knowledge and skills with these new techniques. People working on the Knowledge Management area such as engineers, managers, and analysts can read it, due to the fact that data mining, ontologies and knowledge management areas are linked straightforwardly. References Bernstein, A., Hill, S., Provost, F. (2001). Towards intelligent assistance for the data mining process: An ontology-based approach . CeDER Working Paper IS-02-02, New York University. Bernstein, A., Provost, F., Hill, S. (2005). Towards intelligent assistance for the data mining process: An ontology-based approach for cost/sensitive classification. In IEEE Transactions on Knowledge and Data Engineering , 17(4), 503-518. Bogorny, V., Engel, P. M., Alvares, L.O. (2005). Towards the reduction of spatial join for knowledge discovery in geographic databases using geo-ontologies and spatial integrity constraints. In M. Ackermann, B. Berendt, M. Grobelink, V. Avatek (Eds.), Proceedings ECML/PKDD Second Workshop on Knowledge Discovery and Ontologies (pp. 51-58). Bounif, H., Spaccapietra, S., Pottinger, R. (2006, September 12-15). Requirements ontology and multirepresentation strategy for database schema evolution . Paper presented at the 2nd VLDB Workshop on Ontologies-based techniques for Databases and Information Systems. Seoul, Korea. Brezany, P., Janciak, I., Woehrer, A., Tjoa, A.M. (2004). GridMiner: A framework for knowledge discovery on the Grid from a vision to design and implementation . Cracow Grid Workshop. Cracow, Poland: Springer. Bridewell, W., S?nchez, J. N., Langley, P., Billwen, D. (2006). An Interactive environment for the modeling on discovery of scientific knowledge. International Journal of Human-Computer Studies , 64, 1009-1014. Cannataro, M., Comito, C. (2003, May 20-24). A data mining ontology for Grid programming . Paper presented at the I Workshop on Semantics Peer to Peer and Grid Computing. Budapest. Retrieved March, 2006, from http://www.isi.edu/~stefan/SemPGRID Cannataro, M., Congiusta, A. Pugliese, A., Talia, D., Trunfio, P. (2004). Distributed data mining on Grids: Services, tools, and applications. IEEE Transactions on Systems, Man and Cybernetics, Part B , 34(6), 2451-2465. Cimiano, P., Stumme, G., Hotho, A., Tane, J. (2004). Conceptual knowledge processing with formal concept analysis and ontologies. In Proceedings of The Second International Conference on Formal Concept Analysis (ICFCA 04) . Corcho, O., Fern?ndez-L?pez, M., G?mez-P?rez, A. (2003). Methodologies, tools and languages for building ontologies: where is their meeting point? Data Knowledge Engineering 46(1), 41-64. Amsterdam: Elsevier Science Publishers B. V. Fayyad, U., Piatetsky-Shiapiro, G., Smyth, P., Uthurusamy R. (1996). Advances in knowledge discovery and data mining . Merlo Park, California: AAAI Press. G?mez P?rez, A., Manzano Macho, D., (Eds.) (2003). Survey of ontology learning methods and techniques . Deliverable 1.5 OntoWeb Project Documentation. Universidad Polit?cnica de Madrid. Retrieved November, 2006, from http://www.deri.at/fileadmin/documents/deliverables/Ontoweb/ D1.5.pdf Gottgtroy, P., Kasabov, N., MacDonell, S. (2003, December). An ontology engineering approach for knowledge discovery from data in evolving domains. In Proceedings of Data Mining 2003 Data Mining IV . Boston: WIT. Gottgtroy, P., MacDonell, S., Kasabov, N., Jain, V. (2005). Enhancing data analysis with Ontologies and OLAP . Paper presented at Data Mining 2005, Sixth International Conference on Data Mining, Text Mining and their Business Applications, Skiathos, Greece. Gruber, T. (1993). A translation Approach to Portable Ontology Specification. Knowledge Acquisitions , 5(2), 199-220. Gruber, T. (2002). What is an ontology? Retrieved November, 2006, from http://www-ksl.stanford. edu/kst/what-is-an-ontology.html Han, J., Kamber, M. (2001). Data mining: Concepts and techniques . Morgan Kaufmann. Hern?ndez Orallo, J., Ram?rez Quintana, M., Ferri Ramirez, C. (2004). Introducci?n a la Miner?a de Datos . Madrid: Editorial Pearson Educaci?n SA. Hotho, A., Staab, S., Stumme, G. (2003). Ontologies improve text document clustering. In Proceedings of the 3rd IEEE Conference on Data Mining , Melbourne, FL, (pp.541-544). Langley, P. (2000). The computational support of scientific discovery. International Journal of Human- Computer Studies , 53, 393-410. Langley P. (2006). Knowledge, data, and search in computational discovery . Invited talk at International Workshop on feature selection for data mining: Interfacing machine learning and statistics, (FSDM) April 22, 2006, Bethesda, Maryland in conjunction with 2006 SIAM Conference on data mining (SDM). Pan, D., Shen, J. Y. (2005). Ontology service-based architecture for continuous knowledge discovery. In Proceedings of International Conference on Machine Learning and Cybernetics , 4, 2155-2160. IEEE Press. Pan, D., Pan, Y. (2006, June 21-23). Using ontology repository to support data mining. In Proceedings of the Sixth World Congress on Intelligent Control and Automation , Dalian, China, (pp. 5947-5951). Rennolls, K. (2005). An intelligent framework (O-SS-E) For data mining, knowledge discovery and business intelligence. Keynote Paper. In Proceeding 2nd International Workshop on Philosophies and Methodologies for Knowledge Discovery , PMKD?05, in the DEXA?05 Workshops (pp. 715- 719). IEEE Computer Society Press. ISBN 0-7695-2424-9. Singh, S., Vajirkar, P., Lee, Y. (2003). Context-based data mining using ontologies. In Song, I., Liddle, S. W., Ling, T. W., Scheuermann, P. (Eds.), Proceedings 22nd International Conference on Conceptual Modeling . Lecture Notes in Computer Science (vol. 2813, pp. 405-418). Springer. Spyns, P., Meersman, R., Jarrar, M. (2002). Data modeling versus ontology engineering, SIGMOD Record Special Issue on Semantic Web, Database Management and Information Systems , 31.
数据每年都在成倍增长,但是有用的信息却好像在减少。在过去 20 年里出现的数据挖掘领域正致力于这个问题。它不仅是一个重要的研究领域,而且在现实世界中具有重大的潜在应用价值。 数据挖掘和数据库知识发现( Data Mining Knowledge Discovery in Database ,简称 DMKDD )是 20 世纪 90 年代兴起的一门信息技术领域的前沿技术,它是在数据和数据库急剧增长远远超过人们对数据处理和理解能力的背景下产生的,也是数据库、统计学、机器学习、最优化与计算技术等多学科发展融合的结果。 知识发现是从数据中识别有效的、新颖的、潜在有用的、最终可理解模式的一个复杂过程。数据挖掘是知识发现中通过特定的算法在可接受的计算效率限制内生成特定模式的一个步骤。知识发现是一个包括数据选择、数据预处理、数据变换、数据挖掘、模式评价等步骤,最终得到知识的全过程,而数据挖掘是其中的一个关键步骤。由于数据挖掘对于知识发现的重要性,目前,大多数知识发现的研究都集中在数据挖掘的算法和应用上,因此,很多研究者往往对数据挖掘与知识发现不作严格区分,把二者混淆使用。 目前数据挖掘研究和实践与 20 世纪 60 年代的数据库研究和实践的状态相似。当时应用程序员每次编写程序时,都必须建立一个完整的数据库环境。随着关系数据模型、查询处理和优化技术、事务管理策略和特定查询语言( SQL )与界面的发展,现在的环境已经迥然不同了。在未来几十年内,数据挖掘技术的发展可能会与数据库发展历程相似,就是使数据挖掘技术更易于使用和开发。 参考文献: 1.U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy. Advances in knowledge discovery and data mining. AAAI/MIT Press, 1996. 2. J. Han, M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann Publishers, 2001. ( 2nd Edition, 2006 ) 3. M. H. Dunham. Data Mining: Introductory and Advanced Topics. Pearson Education, Inc., 2003. (郭崇慧,田凤占,靳晓明等译.数据挖掘教程 ( 世界著名计算机教材精选 ) .清华大学出版社, 2005 .)
统计学习理论( Statistical Learning Theory , SLT )是一种专门研究有限样本情况下的统计理论 。该理论针对有限样本统计问题建立了一套新的理论体系,在这种体系下的统计推理规则不仅考虑了对渐近性能的要求,而且追求在现有有限信息的条件下得到最优结果。 V. Vapnik 等人从 20 世纪 70 年代开始致力于此方面研究,到 20 世纪 90 年代中期,随着其理论的不断发展和成熟,也由于神经网络等方法在理论上缺乏实质性进展,统计学习理论开始受到越来越广泛的重视。统计学习理论是建立在一套较坚实的理论基础之上的,为解决有限样本学习问题提供了一个统一的框架。 同时,在统计学习理论基础上发展了一种新的通用预测方法支持向量机( Support Vector Machines , SVM ),已初步表现出很多优于已有方法的性能 ,它能将很多现有方法(比如多项式逼近、径向基函数方法、多层感知器网络)纳入其中,有望帮助解决许多原来难以解决的问题(比如神经网络结构选择问题、局部极值问题等)。 SLT 和 SVM 正在成为继神经网络研究之后新的研究热点,并将推动数据挖掘与机器学习理论和技术的重大发展 。 参考文献: 1. V. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995. 2. V. Vapnik. Statistical learning theory. John Wiley and Sons, Inc., 1998. 3. B. E. Boser, I. Guyon, V. Vapnik. A training algorithm for optimal margin classifiers. In: D. Haussler, Editor, Proceedings of the Fifth Annual ACM Workshop of Computational Learning Theory, 144-152, ACM Press, 1992. 4. C. Cortes, V. Vapnik. Support-vector networks. Machine Learning, 1995, 20, 273-297 5. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998, 2(2), 121-167
独立成分分析( Independent Component Analysis, ICA )是近年来出现的一种强有力的数据分析工具( Hyvarinen A, Karhunen J, Oja E , 2001; Roberts S J, Everson R , 2001 )。 1994 年由 Comon 给出了 ICA 的一个较为严格的数学定义,其思想最早是由 Heranlt 和 Jutten 于 1986 年提出来的。 ICA 从出现到现在虽然时间不长,然而无论从理论上还是应用上,它正受到越来越多的关注,成为国内外研究的一个热点。特别是从应用角度看,它的应用领域与应用前景都是非常广阔的,目前主要应用于盲源分离 、图像处理 、语言识别、通信、生物医学信号处理、脑功能成像研究、 故障诊断、特征提取、金融时间序列分析和数据挖掘 等。 ICA是一种用来从多变量(多维)统计数据里找到隐含的因素或成分的方法,被认为是主成分分析( Principal Component Analysis, PCA )和因子分析( Factor Analysis )的一种扩展。对于盲源分离问题 ,ICA是指在只知道混合信号,而不知道源信号、噪声以及混合机制的情况下,分离或近似地分离出源信号的一种分析过程 。 参考文献 1. Hyvarinen A, Karhunen J, Oja E. ( 2001) . Independent Component Analysis. John Wiley, New York . 2. Roberts S J,Everson, R. ( 2001) . Independent component analysis: principles and practice. Cambridge University Press. 3. Comon P. Independent component analysis a new concept? Signal Processing, 1994, 36: 287-314. 4. Herault J, Jutten C. Space or time adaptive signal processing by neural network models. International Conference On Neural Networks for Computing. Utah, USA, 1986. 图片引自: http://amouraux.webnode.com/research-interests/research-interests-erp-analysis/blind-source-separation-bss-of-erps-using-independent-component-analysis-ica/
第一个是人工智能的历史(History of Artificial Intelligence), 顺着 AI 发展时间线娓娓道来,中间穿插无数牛人故事,且一波三折大气磅礴,可谓事实比想象更令人惊讶。人工智能始于哲学思辨,中间经历了一个没有心理学(尤其是认知神经科学的)的帮助的阶段,仅通过牛人对人类思维的外在表现的归纳、内省,以及数学工具进行探索,其间最令人激动的是 Herbert Simon (决策理论之父,诺奖,跨领域牛人)写的一个自动证明机,证明了罗素的数学原理中的二十几个定理,其中有一个定理比原书中的还要优雅,Simon 的程序用的是启发式搜索,因为公理系统中的证明可以简化为从条件到结论的树状搜索(但由于组合爆炸,所以必须使用启发式剪枝)。后来 Simon 又写了 GPS (General Problem Solver),据说能解决一些能良好形式化的问题,如汉诺塔。但说到底 Simon 的研究毕竟只触及了人类思维的一个很小很小的方面 Formal Logic,甚至更狭义一点 Deductive Reasoning (即不包含 Inductive Reasoning , Transductive Reasoning (俗称 analogic thinking)。还有诸多比如 Common Sense、Vision、尤其是最为复杂的 Language 、Consciousness 都还谜团未解。还有一个比较有趣的就是有人认为 AI 问题必须要以一个物理的 Body 为支撑,一个能够感受这个世界的物理规则的身体本身就是一个强大的信息来源,基于这个信息来源,人类能够自身与时俱进地总结所谓的 Common-Sense Knowledge (这个就是所谓的 Emboddied Mind 理论。 ),否则像一些老兄直接手动构建 Common-Sense Knowledge Base ,就很傻很天真了,须知人根据感知系统从自然界获取知识是一个动态的自动更新的系统,而手动构建常识库则无异于古老的 Expert System 的做法。当然,以上只总结了很小一部分个人觉得比较有趣或新颖的,每个人看到的有趣的地方不一样,比如里面相当详细地介绍了神经网络理论的兴衰。所以建议你看自己一遍,别忘了里面链接到其他地方的链接。 第二个则是人工智能(Artificial Intelligence)。当然,还有机器学习等等。从这些条目出发能够找到许多非常有用和靠谱的深入参考资料。 然后是一些书籍 书籍: 1. 《Programming Collective Intelligence》,近年出的入门好书,培养兴趣是最重要的一环,一上来看大部头很容易被吓走的:P 2. Peter Norvig 的《AI, Modern Approach 2nd》(无争议的领域经典)。 3. 《The Elements of Statistical Learning》,数学性比较强,可以做参考了。 4. 《Foundations of Statistical Natural Language Processing》,自然语言处理领域公认经典。 5. 《Data Mining, Concepts and Techniques》,华裔科学家写的书,相当深入浅出。 6. 《Managing Gigabytes》,信息检索好书。 7. 《Information Theory:Inference and Learning Algorithms》,参考书吧,比较深。 相关数学基础(参考书,不适合拿来通读): 1. 线性代数:这个参考书就不列了,很多。 2. 矩阵数学:《矩阵分析》,Roger Horn。矩阵分析领域无争议的经典。 3. 概率论与统计:《概率论及其应用》,威廉费勒。也是极牛的书,可数学味道太重,不适合做机器学习的。于是讨论组里的 Du Lei 同学推荐了《All Of Statistics》并说到 机器学习这个方向,统计学也一样非常重要。推荐All of statistics,这是CMU的一本很简洁的教科书,注重概念,简化计算,简化与Machine Learning无关的概念和统计内容,可以说是很好的快速入门材料。 4. 最优化方法:《Nonlinear Programming, 2nd》非线性规划的参考书。《Convex Optimization》凸优化的参考书。此外还有一些书可以参考 wikipedia 上的最优化方法条目。要深入理解机器学习方法的技术细节很多时候(如SVM)需要最优化方法作为铺垫。 推荐几本书: 《Machine Learning, Tom Michell》, 1997. 老书,牛人。现在看来内容并不算深,很多章节有点到为止的感觉,但是很适合新手(当然,不能新到连算法和概率都不知道)入门。比如决策树部分就很精彩,并且这几年没有特别大的进展,所以并不过时。另外,这本书算是对97年前数十年机器学习工作的大综述,参考文献列表极有价值。国内有翻译和影印版,不知道绝版否。 《Modern Information Retrieval, Ricardo Baeza-Yates et al》. 1999 老书,牛人。貌似第一本完整讲述IR的书。可惜IR这些年进展迅猛,这本书略有些过时了。翻翻做参考还是不错的。另外,Ricardo同学现在是Yahoo Research for Europe and Latin Ameria的头头。 《Pattern Classification (2ed)》, Richard O. Duda, Peter E. Hart, David G. Stork 大约也是01年左右的大块头,有影印版,彩色。没读完,但如果想深入学习ML和IR,前三章(介绍,贝叶斯学习,线性分类器)必修。 还有些经典与我只有一面之缘,没有资格评价。另外还有两本小册子,论文集性质的,倒是讲到了了不少前沿和细节,诸如索引如何压缩之类。可惜忘了名字,又被我压在箱底,下次搬家前怕是难见天日了。 (呵呵,想起来一本:《Mining the Web - Discovering Knowledge from Hypertext Data》 ) 说一本名气很大的书:《Data Mining: Practical Machine Learning Tools and Techniques》。Weka 的作者写的。可惜内容一般。理论部分太单薄,而实践部分也很脱离实际。DM的入门书已经不少,这一本应该可以不看了。如果要学习了解 Weka ,看文档就好。第二版已经出了,没读过,不清楚。 信息检索方面,Du Lei 同学再次推荐: 信息检索方面的书现在建议看Stanford的那本《Introduction to Information Retrieval》,这书刚刚正式出版,内容当然up to date。另外信息检索第一大牛Croft老爷也正在写教科书,应该很快就要面世了。据说是非常pratical的一本书。 对信息检索有兴趣的同学,强烈推荐翟成祥博士在北大的暑期学校课程,这里有全slides和阅读材料: http://net.pku.edu.cn/~course/cs410/schedule.html maximzhao 同学推荐了一本机器学习: 加一本书:Bishop, 《Pattern Recognition and Machine Learning》. 没有影印的,但是网上能下到。经典中的经典。Pattern Classification 和这本书是两本必读之书。《Pattern Recognition and Machine Learning》是很新(07年),深入浅出,手不释卷。 最后,关于人工智能方面(特别地,决策与判断),再推荐两本有意思的书, 一本是《Simple Heuristics that Makes Us Smart》 另一本是《Bounded Rationality: The Adaptive Toolbox》 不同于计算机学界所采用的统计机器学习方法,这两本书更多地着眼于人类实际上所采用的认知方式,以下是我在讨论组上写的简介: 这两本都是德国ABC研究小组(一个由计算机科学家、认知科学家、神经科学家、经济学家、数学家、统计学家等组成的跨学科研究团体)集体写的,都是引起领域内广泛关注的书,尤其是前一本,後一本则是对 Herbert Simon (决策科学之父,诺奖获得者)提出的人类理性模型的扩充研究),可以说是把什么是真正的人类智能这个问题提上了台面。核心思想是,我们的大脑根本不能做大量的统计计算,使用fancy的数学手法去解释和预测这个世界,而是通过简单而鲁棒的启发法来面对不确定的世界(比如第一本书中提到的两个后来非常著名的启发法:再认启发法(cognition heuristics)和选择最佳(Take the Best)。当然,这两本书并没有排斥统计方法就是了,数据量大的时候统计优势就出来了,而数据量小的时候统计方法就变得非常糟糕;人类简单的启发法则充分利用生态环境中的规律性(regularities),都做到计算复杂性小且鲁棒。 关于第二本书的简介: 1. 谁是 Herbert Simon 2. 什么是 Bounded Rationality 3. 这本书讲啥的: 我一直觉得人类的决策与判断是一个非常迷人的问题。这本书简单地说可以看作是《决策与判断》的更全面更理论的版本。系统且理论化地介绍人类决策与判断过程中的各种启发式方法(heuristics)及其利弊(为什么他们是最优化方法在信息不足情况下的快捷且鲁棒的逼近,以及为什么在一些情况下会带来糟糕的后果等,比如学过机器学习的都知道朴素贝叶斯方法在许多情况下往往并不比贝叶斯网络效果差,而且还速度快;比如多项式插值的维数越高越容易overfit,而基于低阶多项式的分段样条插值却被证明是一个非常鲁棒的方案)。 在此提一个书中提到的例子,非常有意思:两个团队被派去设计一个能够在场上接住抛过来的棒球的机器人。第一组做了详细的数学分析,建立了一个相当复杂的抛物线近似模型(因为还要考虑空气阻力之类的原因,所以并非严格抛物线),用于计算球的落点,以便正确地接到球。显然这个方案耗资巨大,而且实际运算也需要时间,大家都知道生物的神经网络中生物电流传输只有百米每秒之内,所以 computational complexity 对于生物来说是个宝贵资源,所以这个方案虽然可行,但不够好。第二组则采访了真正的运动员,听取他们总结自己到底是如何接球的感受,然后他们做了这样一个机器人:这个机器人在球抛出的一开始一半路程啥也不做,等到比较近了才开始跑动,并在跑动中一直保持眼睛于球之间的视角不变,后者就保证了机器人的跑动路线一定会和球的轨迹有交点;整个过程中这个机器人只做非常粗糙的轨迹估算。体会一下你接球的时候是不是眼睛一直都盯着球,然后根据视线角度来调整跑动方向?实际上人类就是这么干的,这就是 heuristics 的力量。 相对于偏向于心理学以及科普的《决策与判断》来说,这本书的理论性更强,引用文献也很多而经典,而且与人工智能和机器学习都有交叉,里面也有不少数学内容,全书由十几个章节构成,每个章节都是由不同的作者写的,类似于 paper 一样的,很严谨,也没啥废话,跟《Psychology of Problem Solving》类似。比较适合 geeks 阅读哈。 另外,对理论的技术细节看不下去的也建议看看《决策与判断》这类书(以及像《别做正常的傻瓜》这样的傻瓜科普读本),对自己在生活中做决策有莫大的好处。人类决策与判断中使用了很多的 heuristics ,很不幸的是,其中许多都是在适应几十万年前的社会环境中建立起来的,并不适合于现代社会,所以了解这些思维中的缺点、盲点,对自己成为一个良好的决策者有很大的好处,而且这本身也是一个非常有趣的领域。 统计学习理论与支持向量机 统计学习理论(Statistical Learning Theory,SLT)是一种专门研究有限样本情况下的统计理论 。该理论针对有限样本统计问题建立了一套新的理论体系,在这种体系下的统计推理规则不仅考虑了对渐近性能的要求,而且追求在现有有限信息的条件下得到最优结果。V. Vapnik等人从20世纪70年代开始致力于此方面研究,到20世纪90年代中期,随着其理论的不断发展和成熟,也由于神经网络等方法在理论上缺乏实质性进展,统计学习理论开始受到越来越广泛的重视。统计学习理论是建立在一套较坚实的理论基础之上的,为解决有限样本学习问题提供了一个统一的框架。 同时,在统计学习理论基础上发展了一种新的通用预测方法支持向量机(Support Vector Machines,SVM),已初步表现出很多优于已有方法的性能 ,它能将很多现有方法(比如多项式逼近、径向基函数方法、多层感知器网络)纳入其中,有望帮助解决许多原来难以解决的问题(比如神经网络结构选择问题、局部极值问题等)。SLT和SVM正在成为继神经网络研究之后新的研究热点,并将推动数据挖掘与机器学习理论和技术的重大发展 。 参考文献: 1. V. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995. 2. V. Vapnik. Statistical learning theory. John Wiley and Sons, Inc., 1998. 3. B. E. Boser, I. Guyon, V. Vapnik. A training algorithm for optimal margin classifiers. In: D. Haussler, Editor, Proceedings of the Fifth Annual ACM Workshop of Computational Learning Theory, 144-152, ACM Press, 1992. 4. C. Cortes, V. Vapnik. Support-vector networks. Machine Learning, 1995, 20, 273-297 5. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998, 2(2), 121-167 http://www.support-vector-machines.org/SVM_soft.html SHOGUN - is a new machine learning toolbox with focus on large scale kernel methods and especially on Support Vector Machines (SVM) with focus to bioinformatics. It provides a generic SVM object interfacing to several different SVM implementations. Each of the SVMs can be combined with a variety of the many kernels implemented. It can deal with weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain, where an optimal sub-kernel weighting can be learned using Multiple Kernel Learning. Apart from SVM 2-class classification and regression problems, a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and also algorithms to train hidden markov models are implemented. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing. SHOGUN comes in different flavours, a stand-a-lone version and also with interfaces to Matlab(tm), R, Octave, Readline and Python. This is the R package.
转载于: http://bbs.byr.edu.cn/wForum/disparticle.php?boardName=PR_AIID=3229pos=12 我经常在 TopLanguage 讨论组上推荐一些书籍,也经常问里面的牛人们搜罗一些有关的资料,人工智能、机器学习、自然语言处理、知识发现(特别地,数据挖掘)、信息检索这些无疑是 CS 领域最好玩的分支了(也是互相紧密联系的),这里将最近有关机器学习和人工智能相关的一些学习资源归一个类: 首先是两个非常棒的 Wikipedia 条目,我也算是 wikipedia 的重度用户了,学习一门东西的时候常常发现是始于 wikipedia 中间经过若干次 google ,然后止于某一本或几本著作。 第一个是人工智能的历史(History of Artificial Intelligence),我在讨论组上写道: 而今天看到的这篇文章是我在 wikipedia 浏览至今觉得最好的。文章名为《人工智能的历史》,顺着 AI 发展时间线娓娓道来,中间穿插无数牛人故事,且一波三折大气磅礴,可谓事实比想象更令人惊讶。人工智能始于哲学思辨,中间经历了一个没有心理学(尤其是认知神经科学的)的帮助的阶段,仅通过牛人对人类思维的外在表现的归纳、内省,以及数学工具进行探索,其间最令人激动的是 Herbert Simon (决策理论之父,诺奖,跨领域牛人)写的一个自动证明机,证明了罗素的数学原理中的二十几个定理,其中有一个定理比原书中的还要优雅,Simon 的程序用的是启发式搜索,因为公理系统中的证明可以简化为从条件到结论的树状搜索(但由于组合爆炸,所以必须使用启发式剪枝)。后来 Simon 又写了 GPS (General Problem Solver),据说能解决一些能良好形式化的问题,如汉诺塔。但说到底 Simon 的研究毕竟只触及了人类思维的一个很小很小的方面 Formal Logic,甚至更狭义一点 Deductive Reasoning (即不包含 Inductive Reasoning , Transductive Reasoning (俗称 analogic thinking)。还有诸多比如 Common Sense、Vision、尤其是最为复杂的 Language 、Consciousness 都还谜团未解。还有一个比较有趣的就是有人认为 AI 问题必须要以一个物理的 Body 为支撑,一个能够感受这个世界的物理规则的身体本身就是一个强大的信息来源,基于这个信息来源,人类能够自身与时俱进地总结所谓的 Common-Sense Knowledge (这个就是所谓的 Emboddied Mind 理论。 ),否则像一些老兄直接手动构建 Common-Sense Knowledge Base ,就很傻很天真了,须知人根据感知系统从自然界获取知识是一个动态的自动更新的系统,而手动构建常识库则无异于古老的 Expert System 的做法。当然,以上只总结了很小一部分我个人觉得比较有趣或新颖的,每个人看到的有趣的地方不一样,比如里面相当详细地介绍了神经网络理论的兴衰。所以我强烈建议你看自己一遍,别忘了里面链接到其他地方的链接。 顺便一说,徐宥同学打算找时间把这个条目翻译出来,这是一个相当长的条目,看不动 E 文的等着看翻译吧:) 第二个则是人工智能(Artificial Intelligence)。当然,还有机器学习等等。从这些条目出发能够找到许多非常有用和靠谱的深入参考资料。 然后是一些书籍 书籍: 1. 《Programming Collective Intelligence》,近年出的入门好书,培养兴趣是最重要的一环,一上来看大部头很容易被吓走的:P 2. Peter Norvig 的《AI, Modern Approach 2nd》(无争议的领域经典)。 3. 《The Elements of Statistical Learning》,数学性比较强,可以做参考了。 4. 《Foundations of Statistical Natural Language Processing》,自然语言处理领域公认经典。 5. 《Data Mining, Concepts and Techniques》,华裔科学家写的书,相当深入浅出。 6. 《Managing Gigabytes》,信息检索好书。 7. 《Information Theory:Inference and Learning Algorithms》,参考书吧,比较深。 相关数学基础(参考书,不适合拿来通读): 1. 线性代数:这个参考书就不列了,很多。 2. 矩阵数学:《矩阵分析》,Roger Horn。矩阵分析领域无争议的经典。 3. 概率论与统计:《概率论及其应用》,威廉费勒。也是极牛的书,可数学味道太重,不适合做机器学习的。于是讨论组里的 Du Lei 同学推荐了《All Of Statistics》并说到 机器学习这个方向,统计学也一样非常重要。推荐All of statistics,这是CMU的一本很简洁的教科书,注重概念,简化计算,简化与Machine Learning无关的概念和统计内容,可以说是很好的快速入门材料。 4. 最优化方法:《Nonlinear Programming, 2nd》非线性规划的参考书。《Convex Optimization》凸优化的参考书。此外还有一些书可以参考 wikipedia 上的最优化方法条目。要深入理解机器学习方法的技术细节很多时候(如SVM)需要最优化方法作为铺垫。 王宁同学推荐了好几本书: 《Machine Learning, Tom Michell》, 1997. 老书,牛人。现在看来内容并不算深,很多章节有点到为止的感觉,但是很适合新手(当然,不能新到连算法和概率都不知道)入门。比如决策树部分就很精彩,并且这几年没有特别大的进展,所以并不过时。另外,这本书算是对97年前数十年机器学习工作的大综述,参考文献列表极有价值。国内有翻译和影印版,不知道绝版否。 《Modern Information Retrieval, Ricardo Baeza-Yates et al》. 1999 老书,牛人。貌似第一本完整讲述IR的书。可惜IR这些年进展迅猛,这本书略有些过时了。翻翻做参考还是不错的。另外,Ricardo同学现在是Yahoo Research for Europe and Latin Ameria的头头。 《Pattern Classification (2ed)》, Richard O. Duda, Peter E. Hart, David G. Stork 大约也是01年左右的大块头,有影印版,彩色。没读完,但如果想深入学习ML和IR,前三章(介绍,贝叶斯学习,线性分类器)必修。 还有些经典与我只有一面之缘,没有资格评价。另外还有两本小册子,论文集性质的,倒是讲到了了不少前沿和细节,诸如索引如何压缩之类。可惜忘了名字,又被我压在箱底,下次搬家前怕是难见天日了。 (呵呵,想起来一本:《Mining the Web - Discovering Knowledge from Hypertext Data》 ) 说一本名气很大的书:《Data Mining: Practical Machine Learning Tools and Techniques》。Weka 的作者写的。可惜内容一般。理论部分太单薄,而实践部分也很脱离实际。DM的入门书已经不少,这一本应该可以不看了。如果要学习了解 Weka ,看文档就好。第二版已经出了,没读过,不清楚。 信息检索方面,Du Lei 同学再次推荐: 信息检索方面的书现在建议看Stanford的那本《Introduction to Information Retrieval》,这书刚刚正式出版,内容当然up to date。另外信息检索第一大牛Croft老爷也正在写教科书,应该很快就要面世了。据说是非常pratical的一本书。 对信息检索有兴趣的同学,强烈推荐翟成祥博士在北大的暑期学校课程,这里有全slides和阅读材料: http://net.pku.edu.cn/~course/cs410/schedule.html maximzhao 同学推荐了一本机器学习: 加一本书:Bishop, 《Pattern Recognition and Machine Learning》. 没有影印的,但是网上能下到。经典中的经典。Pattern Classification 和这本书是两本必读之书。《Pattern Recognition and Machine Learning》是很新(07年),深入浅出,手不释卷。 最后,关于人工智能方面(特别地,决策与判断),再推荐两本有意思的书, 一本是《Simple Heuristics that Makes Us Smart》 另一本是《Bounded Rationality: The Adaptive Toolbox》 不同于计算机学界所采用的统计机器学习方法,这两本书更多地着眼于人类实际上所采用的认知方式,以下是我在讨论组上写的简介: 这两本都是德国ABC研究小组(一个由计算机科学家、认知科学家、神经科学家、经济学家、数学家、统计学家等组成的跨学科研究团体)集体写的,都是引起领域内广泛关注的书,尤其是前一本,後一本则是对 Herbert Simon (决策科学之父,诺奖获得者)提出的人类理性模型的扩充研究),可以说是把什么是真正的人类智能这个问题提上了台面。核心思想是,我们的大脑根本不能做大量的统计计算,使用fancy的数学手法去解释和预测这个世界,而是通过简单而鲁棒的启发法来面对不确定的世界(比如第一本书中提到的两个后来非常著名的启发法:再认启发法(cognition heuristics)和选择最佳(Take the Best)。当然,这两本书并没有排斥统计方法就是了,数据量大的时候统计优势就出来了,而数据量小的时候统计方法就变得非常糟糕;人类简单的启发法则充分利用生态环境中的规律性(regularities),都做到计算复杂性小且鲁棒。 关于第二本书的简介: 1. 谁是 Herbert Simon 2. 什么是 Bounded Rationality 3. 这本书讲啥的: 我一直觉得人类的决策与判断是一个非常迷人的问题。这本书简单地说可以看作是《决策与判断》的更全面更理论的版本。系统且理论化地介绍人类决策与判断过程中的各种启发式方法(heuristics)及其利弊(为什么他们是最优化方法在信息不足情况下的快捷且鲁棒的逼近,以及为什么在一些情况下会带来糟糕的后果等,比如学过机器学习的都知道朴素贝叶斯方法在许多情况下往往并不比贝叶斯网络效果差,而且还速度快;比如多项式插值的维数越高越容易overfit,而基于低阶多项式的分段样条插值却被证明是一个非常鲁棒的方案)。 在此提一个书中提到的例子,非常有意思:两个团队被派去设计一个能够在场上接住抛过来的棒球的机器人。第一组做了详细的数学分析,建立了一个相当复杂的抛物线近似模型(因为还要考虑空气阻力之类的原因,所以并非严格抛物线),用于计算球的落点,以便正确地接到球。显然这个方案耗资巨大,而且实际运算也需要时间,大家都知道生物的神经网络中生物电流传输只有百米每秒之内,所以 computational complexity 对于生物来说是个宝贵资源,所以这个方案虽然可行,但不够好。第二组则采访了真正的运动员,听取他们总结自己到底是如何接球的感受,然后他们做了这样一个机器人:这个机器人在球抛出的一开始一半路程啥也不做,等到比较近了才开始跑动,并在跑动中一直保持眼睛于球之间的视角不变,后者就保证了机器人的跑动路线一定会和球的轨迹有交点;整个过程中这个机器人只做非常粗糙的轨迹估算。体会一下你接球的时候是不是眼睛一直都盯着球,然后根据视线角度来调整跑动方向?实际上人类就是这么干的,这就是 heuristics 的力量。 相对于偏向于心理学以及科普的《决策与判断》来说,这本书的理论性更强,引用文献也很多而经典,而且与人工智能和机器学习都有交叉,里面也有不少数学内容,全书由十几个章节构成,每个章节都是由不同的作者写的,类似于 paper 一样的,很严谨,也没啥废话,跟《Psychology of Problem Solving》类似。比较适合 geeks 阅读哈。 另外,对理论的技术细节看不下去的也建议看看《决策与判断》这类书(以及像《别做正常的傻瓜》这样的傻瓜科普读本),对自己在生活中做决策有莫大的好处。人类决策与判断中使用了很多的 heuristics ,很不幸的是,其中许多都是在适应几十万年前的社会环境中建立起来的,并不适合于现代社会,所以了解这些思维中的缺点、盲点,对自己成为一个良好的决策者有很大的好处,而且这本身也是一个非常有趣的领域。 (完)