博文

一个分类模型的评估数据

已有 3113 次阅读 2012-5-16 17:36 |个人分类:weka|系统分类:科研笔记|关键词:学者| 评估, 模型, 中文文本分类

之前都是在小训练集上做的实验，当训练集较大时，MyEclipse总是报heap space错。修改.ini文件也不行。

后来，修改MyEclipse的run configuration中的arguments的VM arguments为-Xms500m -Xmx1024m，不再报错了。下面是运行结果（分类器是NaiveBayes）。结果不太理想（正确分类68.6%），有待改善。还有，似乎Accuracy By Class 与Confusion Matrix数据不一致，差距很大，why？

weka.filters.unsupervised.attribute.StringToWordVector in:9804
Number of instances: 9804
Number of attributes: 9302
=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall F-Measure   ROC Area Class
0.794     0.015      0.788     0.794     0.791      0.966    C11-Space
0.5       0.002      0.444     0.5       0.471      0.975    C15-Energy
0.704     0.005      0.264     0.704     0.384      0.981    C16-Electronics
0.68      0.003      0.395     0.68      0.5        0.99     C17-Communication
0.858     0.007      0.953     0.858     0.903      0.982    C19-Computer
0.545     0.006      0.234     0.545     0.327      0.945    C23-Mine
0.86      0.015      0.251     0.86      0.389      0.98     C29-Transport
0.726     0.015      0.8       0.726     0.761      0.951    C3-Art
0.72      0.026      0.799     0.72      0.757      0.946    C31-Enviornment
0.614     0.014      0.838     0.614     0.709      0.945    C32-Agriculture
0.762     0.044      0.773     0.762     0.767      0.924    C34-Economy
0.647     0.003      0.508     0.647     0.569      0.983    C35-Law
0.843     0.019      0.189     0.843     0.309      0.985    C36-Medical
0.892     0.064      0.096     0.892     0.173      0.969    C37-Military
0.432     0.016      0.754     0.432     0.549      0.832    C38-Politics
0.591     0.016      0.84      0.591     0.694      0.911    C39-Sports
0.636     0.008      0.212     0.636     0.318      0.971    C4-Literature
0.644     0.017      0.189     0.644     0.292      0.931    C5-Education
0.568     0.002      0.532     0.568     0.549      0.95     C6-Philosophy
0.584     0.038      0.436     0.584     0.499      0.908    C7-History

Correctly Classified Instances        6730               68.6455 %
Incorrectly Classified Instances      3074               31.3545 %
Kappa statistic                          0.6529
Mean absolute error                      0.0313
Root mean squared error                  0.1759
Relative absolute error                 35.2848 %
Root relative squared error             83.4657 %
Total Number of Instances             9804

=== Confusion Matrix ===

    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p    q    r    s    t   <-- classified as
508    1    3    5   31    2   12    0    2    1   11    0   16   30    0   17    0    0    1    0 |    a = C11-Space
    1   16    0    1    0    3    3    0    1    0    0    0    2    4    0    0    0    1    0    0 |    b = C15-Energy
    2    0   19    1    0    1    2    0    0    0    0    0    1    0    0    1    0    0    0    0 |    c = C16-Electronics
    0    0    3   17    0    0    2    0    0    0    0    0    1    1    0    0    0    1    0    0 |    d = C17-Communication
   59    0   33    9 1164    0    7    3    4    0   10    3   16   33    0    8    3    3    0    2 |    e = C19-Computer
    0    1    0    0    0   18    6    0    1    2    0    0    1    3    0    0    0    1    0    0 |    f = C23-Mine
    1    1    0    0    0    1   49    0    0    1    0    0    1    0    1    1    0    1    0    0 |    g = C29-Transport
    0    0    0    0    0    1    1 537    0    0    3    0   12   27   12   18   44   16    1   68 |    h = C3-Art
   43   14    5    0   17    9   21    0 876   34   63    7   60   20    0   30    2   16    0    0 |    i = C31-Enviornment
    2    0    2    2    0   14   25    0 165 627 112    4   22    4    1    9    1   20    0   11 |    j = C32-Agriculture
    1    0    4    6    2   25   41    3    3   78 1219   12   13   30   74   29    0   15    1   44 |    k = C34-Economy
    0    0    0    0    0    0    0    0    0    0    0   33    0   14    1    0    0    3    0    0 |    l = C35-Law
    0    0    0    0    0    0    0    0    0    0    0    0   43    4    0    1    0    3    0    0 |    m = C36-Medical
    0    1    0    0    0    0    0    0    0    0    0    0    0   66    0    1    0    5    0    1 |    n = C37-Military
    0    1    1    0    0    1    1    4    0    0   74    5    1 299 442    7    4   16   14 154 |    o = C38-Politics
   28    1    1    2    8    1   16   35   45    2   64    1   26 138   28 740    4   41    2   70 |    p = C39-Sports
    0    0    1    0    0    0    1    2    0    0    0    0    0    2    0    1   21    3    0    2 |    q = C4-Literature
    0    0    0    0    0    0    1    0    0    0    1    0    3    5    2    8    1   38    0    0 |    r = C5-Education
    0    0    0    0    0    0    0    1    0    0    1    0    0    0    4    1    1   11   25    0 |    s = C6-Philosophy
    0    0    0    0    0    1    7   86    0    3   19    0    9   11   21    9   18    7    3 272 |    t = C7-History

行列不齐，放个截图：