This package, RcppMeCab, is a Rcpp wrapper for the
part-of-speech morphological analyzer MeCab. It supports
native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and
Korean) MeCab library. This package fully utilizes the power
Rcpp brings R computation to analyze texts
faster.
Please see this for easy installation and usage examples in Korean.
RcppMeCab builds MeCab from source at install time. The MeCab variant
is selected by the MECAB_LANG environment variable:
MECAB_LANG |
Backend | Version | Source |
|---|---|---|---|
ko (default) |
mecab-ko-msvc | 0.999 | Pusnow/mecab-ko-msvc |
ja |
MeCab | 0.996 | taku910/mecab |
On Linux and macOS, if MeCab is already installed system-wide
(detected via mecab-config), RcppMeCab uses the system
installation regardless of MECAB_LANG.
RcppMeCab automatically downloads and builds MeCab from source if it is not already installed on your system. No manual MeCab installation is required.
install.packages("RcppMeCab") # install from CRAN
# or install the development version
# install.packages("devtools")
devtools::install_github("junhewk/RcppMeCab")If you already have MeCab installed (e.g. via
brew install mecab on macOS, or
apt install libmecab-dev on Linux), RcppMeCab will use your
system installation.
Set MECAB_LANG before installation to choose the MeCab
variant:
# Korean (default)
install.packages("RcppMeCab", type = "source")
# Japanese
Sys.setenv(MECAB_LANG = "ja")
install.packages("RcppMeCab", type = "source")A MeCab dictionary is automatically downloaded and installed during package installation:
MECAB_LANG=ko, default): mecab-ko-dic
(pre-compiled, from mecab-ko-msvc releases)MECAB_LANG=ja): IPAdic (compiled from source
during installation)The bundled dictionary is stored in the package’s dic/
directory and used automatically — no manual dictionary setup is
required.
You can download and install dictionaries for other languages after
installation using download_dic(). No system-level MeCab
installation is required — dictionary compilation is handled entirely
within R.
download_dic("ja") # download and compile Japanese IPAdic
download_dic("ko") # download Korean mecab-ko-dic
download_dic("zh") # download and compile Chinese mecab-jiebaDictionaries are stored in the user data directory
(tools::R_user_dir("RcppMeCab", "data")) and persist across
R sessions.
Use list_dic() to see all installed dictionaries:
list_dic()
#> lang name path active
#> 1 bundled bundled /path/to/RcppMeCab/dic TRUE
#> 2 ja ipadic ~/.local/share/R/RcppMeCab/ja FALSE
#> 3 ko mecab-ko-dic ~/.local/share/R/RcppMeCab/ko FALSE
#> 4 zh mecab-jieba ~/.local/share/R/RcppMeCab/zh FALSEThis package has pos and posParallel
functions.
pos(sentence) # returns a list
pos(sentence, join = FALSE) # morphemes only (tags as vector names)
pos(sentence, format = "data.frame") # returns a data frame
pos(sentence, user_dic = "path") # with a compiled user dictionary
posParallel(sentence) # parallelized, faster for large inputsUse the lang parameter to select a dictionary by
language:
pos("東京は日本の首都です。", lang = "ja")
pos("안녕하세요", lang = "ko")
pos("我是中国人。", lang = "zh")Or set a default with set_dic():
set_dic("ja")
pos("東京は日本の首都です。") # uses Japanese dictionary
set_dic("ko")
pos("안녕하세요") # uses Korean dictionary
set_dic("bundled") # switch back to the build-time dictionaryYou can also specify a custom dictionary path directly:
pos("text", sys_dic = "/path/to/custom-dic")
options(mecabSysDic = "/path/to/custom-dic")sentence: text to analyzejoin: if TRUE (default), output is
morpheme/tag; if FALSE, output is
morpheme with tag as attributeformat: "list" (default) or
"data.frame"lang: language code ("ja",
"ko", or "zh") to select a dictionary
installed via download_dic(). Overrides
sys_dic when specified.sys_dic: directory containing dicrc,
sys.dic, etc. Set a default with
options(mecabSysDic = "/path/to/dic")user_dic: path to a user dictionary compiled by
dict_index()Note: provide full paths for sys_dic and
user_dic (no tilde ~/ expansion).
RcppMeCab provides the dict_index() function to compile
user dictionaries directly from R, without needing the
mecab-dict-index command-line tool.
Prepare your entries as a CSV file (Japanese format, Korean format), then compile:
dict_index(
dic_csv = "entries.csv",
out_dic = "userdic.dic",
dic_dir = "/path/to/mecab-dic"
)
# Then use the compiled dictionary:
pos("some text", user_dic = "userdic.dic")Junhewk Kim (junhewk.kim@gmail.com), Taku Kudo
Akiru Kato, Patrick Schratz