「MMSEG」- A Word Identification System for Mandarin Chinese Text

这篇文章的内容并不完善，只是临时做一个笔记。对于我个人而言，最重要的两部分是「创建词典文件并测试」与「关于 unigram.txt 文件的格式」；

MMSEG 是什么？

参考 MMSEG 主页：http://technology.chtsai.org/mmseg

MMSEG 安装

Sphinx 中文分词 Coreseek+Mmseg 安装配置和示例：https://blog.csdn.net/l1028386804/article/details/48897589

创建词典文件并测试

执行下面文件创建二进制的词典文件（关于 unigram.txt 文件的格式参考后面的部分）：

# mmseg -u unigram.txt

该命令会生成 unigram.txt.uni 文件。将该文件重命名为 uni.lib 文件；

执行下面命令检查词典：

# echo “金交所” > whatever.txt

# mmseg -d /usr/local/mmseg3/etc whatever.txt

如果字典文件里包含了”金交所“，那会产生类似如下的输出：

金交所 /x

Word Splite took: 0 ms.

如果字典中没有”金交所“这个词，则会产生如下的输出：

金 /x 交 /x 所 /x

Word Splite took: 0 ms.

输出中显示整个词语被拆分成单字；

关于 unigram.txt 文件的格式

以下的内容摘自 unigram.txt 文件：

阿宝 1

x:1

阿西吧 1

x:1

阿华 1

x:1

注意，中文与后面的数字 1 之间是一个制表符（ASCII TAB），一定不能为空格。使用下面的 PHP 脚本 words2mmseg.php 可以生成 unigram.txt 文件：

<?php
$sourcefile = null;
$targetfile = null;
$options = getopt("s:o:");

if (!isset($options['s'])) {
    $sourcefile = "words.txt";
} else {
    $sourcefile = $options['s'];
}

if (!isset($options['o'])) {
    $targetfile = "mmseg-dict.txt";
} else {
    $targetfile = $options['o'];
}

convert_file($sourcefile, $targetfile);

function convert_file($sourcefile, $targetfile) {
    $rhandle = fopen($sourcefile, "r");
    $whandle = fopen($targetfile, "w");
    if ($rhandle) {
        while (($buffer = fgets($rhandle, 4096)) !== false) {
            $line = trim($buffer, "\r\n\t ");
            fwrite($whandle, "$line\t1\r\nx:1\r\n");
        }
        if (!feof($rhandle)) {
            echo "Error: unexpected fgets() fail\n";
        }
        fclose($rhandle);
        fclose($whandle);
    }
}

在 Shell 中执行如下 PHP 命令：

# php words2mmseg.php -s source.txt -o unigram.txt

其中 source.txt 就是原始的词典列表，它的格式如下：

阿宝

阿西吧

阿华

…省略…

即，一个普通的词语列表；

参考文献

使用搜狗词库制作 mmseg 自定义词典
 mmseg3 添加新词库
 Mmseg 中文分词算法解析
 coreseek 之 mmseg 分词和词库拓展