RcppMeCab

License R CRAN Downloads R-CMD-check

This package, RcppMeCab, is a Rcpp wrapper for the part-of-speech morphological analyzer MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power Rcpp brings R computation to analyze texts faster.

Please see this for easy installation and usage examples in Korean.

MeCab backends

RcppMeCab builds MeCab from source at install time. The MeCab variant is selected by the MECAB_LANG environment variable:

MECAB_LANG Backend Version Source
ko (default) mecab-ko-msvc 0.999 Pusnow/mecab-ko-msvc
ja MeCab 0.996 taku910/mecab

On Linux and macOS, if MeCab is already installed system-wide (detected via mecab-config), RcppMeCab uses the system installation regardless of MECAB_LANG.

Installation

Linux, macOS, and Windows

RcppMeCab automatically downloads and builds MeCab from source if it is not already installed on your system. No manual MeCab installation is required.

install.packages("RcppMeCab") # install from CRAN

# or install the development version
# install.packages("devtools")
devtools::install_github("junhewk/RcppMeCab")

If you already have MeCab installed (e.g. via brew install mecab on macOS, or apt install libmecab-dev on Linux), RcppMeCab will use your system installation.

Language selection

Set MECAB_LANG before installation to choose the MeCab variant:

# Korean (default)
install.packages("RcppMeCab", type = "source")

# Japanese
Sys.setenv(MECAB_LANG = "ja")
install.packages("RcppMeCab", type = "source")

Dictionary

A MeCab dictionary is automatically downloaded and installed during package installation:

The bundled dictionary is stored in the package’s dic/ directory and used automatically — no manual dictionary setup is required.

Downloading additional dictionaries

You can download and install dictionaries for other languages after installation using download_dic(). No system-level MeCab installation is required — dictionary compilation is handled entirely within R.

download_dic("ja") # download and compile Japanese IPAdic
download_dic("ko") # download Korean mecab-ko-dic
download_dic("zh") # download and compile Chinese mecab-jieba

Dictionaries are stored in the user data directory (tools::R_user_dir("RcppMeCab", "data")) and persist across R sessions.

Use list_dic() to see all installed dictionaries:

list_dic()
#>      lang         name                              path active
#> 1 bundled      bundled /path/to/RcppMeCab/dic              TRUE
#> 2      ja       ipadic ~/.local/share/R/RcppMeCab/ja      FALSE
#> 3      ko mecab-ko-dic ~/.local/share/R/RcppMeCab/ko      FALSE
#> 4      zh  mecab-jieba ~/.local/share/R/RcppMeCab/zh      FALSE

Usage

This package has pos and posParallel functions.

pos(sentence)                        # returns a list
pos(sentence, join = FALSE)          # morphemes only (tags as vector names)
pos(sentence, format = "data.frame") # returns a data frame
pos(sentence, user_dic = "path")     # with a compiled user dictionary
posParallel(sentence)                # parallelized, faster for large inputs

Switching languages

Use the lang parameter to select a dictionary by language:

pos("東京は日本の首都です。", lang = "ja")
pos("안녕하세요", lang = "ko")
pos("我是中国人。", lang = "zh")

Or set a default with set_dic():

set_dic("ja")
pos("東京は日本の首都です。") # uses Japanese dictionary
set_dic("ko")
pos("안녕하세요")              # uses Korean dictionary
set_dic("bundled")             # switch back to the build-time dictionary

You can also specify a custom dictionary path directly:

pos("text", sys_dic = "/path/to/custom-dic")
options(mecabSysDic = "/path/to/custom-dic")

Parameters

Note: provide full paths for sys_dic and user_dic (no tilde ~/ expansion).

Compiling a user dictionary

RcppMeCab provides the dict_index() function to compile user dictionaries directly from R, without needing the mecab-dict-index command-line tool.

Prepare your entries as a CSV file (Japanese format, Korean format), then compile:

dict_index(
  dic_csv = "entries.csv",
  out_dic = "userdic.dic",
  dic_dir = "/path/to/mecab-dic"
)

# Then use the compiled dictionary:
pos("some text", user_dic = "userdic.dic")

Authors

Junhewk Kim (junhewk.kim@gmail.com), Taku Kudo

Contributors

Akiru Kato, Patrick Schratz