building the suenjeon library
this document introduces how to build the seunjeon library, a morphological analyzer implemented as a scalar (Java), and how to update the dictionary. the 1.5.0 version released in the central repository contains its own dictionary data. this article summarizes the process of building the library and getting the dictionary data out.
1. Preparation
prepare the following development environment
- Intellij - Since our source code is written in scala and we will be building with sbt (scala build tool), we will be working on Intellij (Community Eidition).
- for example
1.1. IntelliJ Scala Plugin
Once you have downloaded and installed IntelliJ, set up scala development.
click ItelliJ IDEA
> Preferences
from the menu and select Plugins
from the left menu.
install the Scala
plugin.
1.2. Repository Url
link to seunjeon repository
click the [CLONE] button on the right side of the page to copy the repository address.
repository addresshttps://bitbucket.org/eunjeon/seunjeon.git
1.3. New Project
After installing IntelliJ, click Clone Repository
.
Clone RepositoryVersion Control: [Git]
URL : _________________
Directory:
enter the repository address in the URL and click the [CLONE] button.
you have downloaded the seunjeon repository as follows.
''```txt:Project Structure . ├── CONTRIBUTORS.md ├── README.md ├── TODOS.md ├── build.sbt ├── elasticsearch ├── mecab-en-dic -> mecab-en-dic-2.0.1-20150920 ├── project │ ├── build.properties │ └── plugins.sbt ├── scripts └── src
TXT
# 2. sbt(build tool for scala)
a build tool written in scala that acts like a maven or gradle.
## 2.1. SBT setup and introduction
in the [Project Structure], the file named `build.sbt` corresponds to `build.gradle` in gradle.
the following three files are involved in the build.
├── build.sbt ... ├── project │ ├── build.properties │ └── plugins.sbt ...
TXT
* build.sbt - describes the build procedure
* build.properties - sbt version (1.0.4)
* plugin.sbt - list of plugins to use. if you open the file, you'll see that I added `"com.eed3si9n:sbt-assembly:0.14.6"`.
when I open the file, I get an error because there is no SBT configuration.
originally, you need to install sbt separately and run it from the CLI as follows.
```txt:sbt
sbt clean package
```txt:sbt
* same as mvn clean package
however, IntelliJ automatically provides the necessary tools and preferences in the IDE.
In IntelliJ, click `sbt shell` in the lower right corner.

you will see a prompt that looks like this
(initializing)
TXT
* installing - Installation and initialization takes time.
you'll get an error like this
... [info] Updating {file://seunjeon/project/}seunjeon-build... [info] Done updating. //seunjeon/build.sbt:9: error: not found: value useGpg useGpg := true, ^] [error] sbt.compiler.EvalException: Type error in expression [error] sbt.compiler.EvalException: Type error in expression [error] Use 'last' for the full log. Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?
TXT
* useGpg := true` - As if PGP signing is enabled when deploying to a Maven repository
it looks like seunjeon added this to enable PGP signing when deploying jar files to a maven repository.
since we only need the output (jar), we try one of the following two methods.
#### Method 1 Commenting
comment out `useGpg:=true`.
```ts:Annotate
publishArtifact in Test := false,
// useGpg := true,
publishTo := {
Method 2 Add the PGP plugin
open the plugins.sbt
file and add the sbt-pgp
plugin like this
TXTaddSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.1.0")
- added
com.jsuereth:sbt-pgp:1.1.0
. - replace
useGpg
with false.
after modifying the settings either way, go back to sbt shell
, stop it, and start the shell again.
sbt shell[info] Loading settings from idea.sbt ...
[info] Loading global plugins from /Users/____/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /Users/____/workspace/ws-scala/seunjeon/project
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Set current project to seunjeon (in build file:/Users/____/workspace/ws-scala/seunjeon/)
[IJ]sbt:seunjeon>
execute the clean
and package
commands in the sbt shell, respectively.
(sbt shell) clean command> clean
[success] Total time: 1 s, completed Nov 15, 2024 2:44:50 PM
[IJ]sbt:seunjeon>
(sbt shell) package Command> package
[info] Updating {...}seunjeon...
[info] Done updating.
[info] Compiling 21 Scala sources and 1 Java source to /____/seunjeon/target/scala-2.12/classes ...
[info] Done compiling.
[info] Packaging /____/seunjeon/target/scala-2.12/seunjeon_2.12-1.5.0.jar ...
[info] Done packaging.
[success] Total time: 19 s, completed Nov 15, 2024 2:46:06 PM
[IJ]sbt:seunjeon>
i built the class files and jar files in the target directory like below.
outputtarget
├── scala-2.12
│ ├── classes
│ │ └── ...class files
│ ├── resolution-cache
│ └── seunjeon_2.12-1.5.0.jar
however, if you look at the seunjeon_2.12-1.5.0.jar
file, you'll see that it's very small at 160 KB.
The jar file registered in the Maven repository is about 24 MB because it contains a dictionary.
TXTjava/main/resources/dictionalry
+- termDict.dat
+- dictMapper.dat
+- trie.dat
...
2.2. Pre-build
the DictBuilder.scala
file already adds functionality to build dictionaries.
DictBuilder.scala def main(args: Array[String]): Unit = {
clear()
copyUnkDef()
copyLeftIdDef()
copyRightIdDef()
println("compiling lexicon dictionary...")
buildLexiconDict()
println("compiling connection-cost dictionary...")
buildConnectionCostDict()
println("complete")
}
- create a dictionary inside
src/main/resources/dictionary
.
IntelliJ provides a UI screen to execute the main method.

button to run it, it fails like this
TXTException in thread "main" java.nio.file.NoSuchFileException: /____/seunjeon/mecab-en-dic/unk.def
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.throwAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at org.bitbucket.eunjeon.seunjeon.DictBuilder$.copyDefFile(DictBuilder.scala:82)
at org.bitbucket.eunjeon.seunjeon.DictBuilder$.copyUnkDef(DictBuilder.scala:68)
at org.bitbucket.eunjeon.seunjeon.DictBuilder$.main(DictBuilder.scala:54)
at org.bitbucket.eunjeon.seunjeon.DictBuilder.main(DictBuilder.scala)
Process finished with exit code 1
DictBuilder.main
reads the .csv
, .def
files needed to build the dictionary inside the mecab-en-dic
directory.
TXT```txt:Project Structure
.
├── mecab-en-dic -> mecab-en-dic-2.0.1-20150920
note that mecab-en-dic
is a symbolic link, there is no actual directory it references.
to register neologisms in the dictionary and build a new one, you need to download the dictionary information from the https://bitbucket.org/eunjeon/mecab-ko-dic
repository.
- mecab-en-dic - provides a dictionary (2018 is the latest version)
this functionality is already provided by scripts/download-dict.sh
.
download-dict.sh...
wget -O ${DICT_NAME}.tar.gz "https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/${DICT_NAME}.tar.gz"
tar zxvf $DICT_NAME.tar.gz
rm mecab-en-dic
ln -sf $DICT_NAME mecab-en-dic
you can run it like this, but you'll need to enter the version of the dictionary you want to download.
download dictionary$ ./scripts/download-dict.sh dictionary-version
copy the name of the most recent dictionary from the mecab-en-dict/downlaod page.
TXTmecab-ko-dic-2.1.1-20180720.tar.gz
mecab-ko-dic-2.1.0-20180716.tar.gz
...
leave out the `.tar.gz' and enter the filename as follows.
download dictionary$ ./scripts/download-dict.sh mecab-en-dic-2.1.1-20180720
execution result
log$ ./scripts/download-dict.sh mecab-en-dic-2.1.1-20180720
Resolving bitbucket.org (bitbucket.org)
HTTP request sent, awaiting response... 302 Found
...
HTTP request sent, awaiting response... 200 OK
Length: 49775061 (47M) [application/x-tar]
Saving to: 'mecab-en-dic-2.1.1-20180720.tar.gz'
2024-11-15 15:30:47 (7.72 MB/s) - 'mecab-en-dic-2.1.1-20180720.tar.gz' saved [49775061/49775061]
after downloading and unzipping the file, the symbolic link was also fixed.
Dictionary structure├── mecab-en-dic -> mecab-en-dic-2.1.1-20180720
├── mecab-ko-dic-2.1.1-20180720
├── mecab-ko-dic-2.1.1-20180720.tar.gz
- the
mecab-en-dic
links to the 2018 dictionary directory.
running DictBuilder.main
again will create the dictionary files under src/main/resources/dictionary
.
DictBuilder.main logcompiling lexicon dictionary...
INFO: csv parsing is completed. (191 ms)
INFO: terms & mapper building is completed. (800 ms)
INFO: added to trie builder (523 ms)
INFO: double-array trie building is completed. (21282 ms)
building LexiconDict OK. (termSize = 816243 mapper size = 774556)
compiling connection-cost dictionary...
INFO: connectionDict loading is completed. (25395 ms)
building connection cost dictionary OK. (rightSize : 3822, leftSize : 2693, size : 10292648)
complete
Dictionary datasrc/main/resources/
├── char.def
└── dictionary
├── connection_cost.dat
├── dictMapper.dat
├── left-id.def
├── right-id.def
├── termDict.dat
├── trie.dat
└── unk.def