博文

TermDocumentMatrix的几个参数

已有 11068 次阅读 2012-6-17 20:29 |个人分类:tm|系统分类:科研笔记|关键词:学者| 中文

从网上找到的代码片段，

在对分词后的中文文本进行处理时，

往往仅仅使用类似：

c <- Corpus(VectorSource(re))

的代码，来构造语料库，然后使用

TermDocumentMatrix(c)

函数，来求词汇文档矩阵。

例如这个博文：中文文本挖掘小例子及程序

http://blog.sina.com.cn/s/blog_04f7e6c10100pwt2.html

但是，由于汉字的特殊性，使用如下参数中的一个或多个可以避免可能的分析错误：

（参阅 termFreq {tm} ）

removePunctuation

A logical value indicating whether punctuation characters should be removed from doc, a custom function which performs punctuation removal, or a list of arguments for removePunctuation. Defaults to FALSE.

removeNumbers

A logical value indicating whether numbers should be removed from doc or a custom function for number removal. Defaults to FALSE.

stopwords

Either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package, a character vector holding custom stopwords, or a custom function for stopword removal. Defaults to FALSE.

bounds

A list with a tag local whose value must be an integer vector of length 2. Terms that appear less often in doc than the lower bound bounds$local[1] or more often than the upper bound bounds$local[2] are discarded. Defaults to list(local = c(1,Inf)) (i.e., every token will be used).

wordLengths

An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.

一个例子：

dtm<-TermDocumentMatrix(ovid,

control = list(wordLengths = c(1, Inf),

removePunctuation = TRUE,

removeNumbers = TRUE,

stopwords=FALSE) )

转载本文请联系原作者获取授权，同时请注明本文来自王水科学网博客。
链接地址：https://m.sciencenet.cn/blog-461456-583133.html

上一篇：『转设』的玄机
下一篇：『借贷消费』提升人生幸福：513才会相信

收藏分享

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

王水

扫一扫，分享此博文

全部作者的精选博文

• 厦大的菠萝是否真的可以“吸收异味”

老码农分享 http://blog.sciencenet.cn/u/seawan //敲键读书打酱油;

博文

TermDocumentMatrix的几个参数

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

王水

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

老码农分享 http://blog.sciencenet.cn/u/seawan //敲键读书打酱油;

博文

TermDocumentMatrix的几个参数

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

王水

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)