building the suenjeon library

this document introduces how to build the seunjeon library, a morphological analyzer implemented as a scalar (Java), and how to update the dictionary. the 1.5.0 version released in the central repository contains its own dictionary data. this article summarizes the process of building the library and getting the dictionary data out.

1. Preparation

prepare the following development environment

  • Intellij - Since our source code is written in scala and we will be building with sbt (scala build tool), we will be working on Intellij (Community Eidition).
  • for example

1.1. IntelliJ Scala Plugin

Once you have downloaded and installed IntelliJ, set up scala development.

click ItelliJ IDEA > Preferences from the menu and select Plugins from the left menu.

install the Scala plugin.

1.2. Repository Url

link to seunjeon repository

click the [CLONE] button on the right side of the page to copy the repository address.

repository address
https://bitbucket.org/eunjeon/seunjeon.git

1.3. New Project

After installing IntelliJ, click Clone Repository.

Clone Repository
Version Control: [Git] URL : _________________ Directory:

enter the repository address in the URL and click the [CLONE] button.

you have downloaded the seunjeon repository as follows.

''```txt:Project Structure . ├── CONTRIBUTORS.md ├── README.md ├── TODOS.md ├── build.sbt ├── elasticsearch ├── mecab-en-dic -> mecab-en-dic-2.0.1-20150920 ├── project │ ├── build.properties │ └── plugins.sbt ├── scripts └── src

TXT
# 2. sbt(build tool for scala) a build tool written in scala that acts like a maven or gradle. ## 2.1. SBT setup and introduction in the [Project Structure], the file named `build.sbt` corresponds to `build.gradle` in gradle. the following three files are involved in the build.

├── build.sbt ... ├── project │ ├── build.properties │ └── plugins.sbt ...

TXT
* build.sbt - describes the build procedure * build.properties - sbt version (1.0.4) * plugin.sbt - list of plugins to use. if you open the file, you'll see that I added `"com.eed3si9n:sbt-assembly:0.14.6"`. when I open the file, I get an error because there is no SBT configuration. originally, you need to install sbt separately and run it from the CLI as follows. ```txt:sbt sbt clean package ```txt:sbt * same as mvn clean package however, IntelliJ automatically provides the necessary tools and preferences in the IDE. In IntelliJ, click `sbt shell` in the lower right corner. ![sbt shell](https://firebasestorage.googleapis.com/v0/b/fb-backend-test.appspot.com/o/blog%2Fmenu_sbt_shell.png?alt=media&token=37fb2d20-e032-4ae9-8802-acc196085b4f) you will see a prompt that looks like this

(initializing)

TXT
* installing - Installation and initialization takes time. you'll get an error like this

... [info] Updating {file://seunjeon/project/}seunjeon-build... [info] Done updating. //seunjeon/build.sbt:9: error: not found: value useGpg useGpg := true, ^] [error] sbt.compiler.EvalException: Type error in expression [error] sbt.compiler.EvalException: Type error in expression [error] Use 'last' for the full log. Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

TXT
* useGpg := true` - As if PGP signing is enabled when deploying to a Maven repository it looks like seunjeon added this to enable PGP signing when deploying jar files to a maven repository. since we only need the output (jar), we try one of the following two methods. #### Method 1 Commenting comment out `useGpg:=true`. ```ts:Annotate publishArtifact in Test := false, // useGpg := true, publishTo := {

Method 2 Add the PGP plugin

open the plugins.sbt file and add the sbt-pgp plugin like this

TXT
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6") addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.1.0")
  • added com.jsuereth:sbt-pgp:1.1.0.
  • replace useGpg with false.

after modifying the settings either way, go back to sbt shell, stop it, and start the shell again.

sbt shell
[info] Loading settings from idea.sbt ... [info] Loading global plugins from /Users/____/.sbt/1.0/plugins [info] Loading settings from plugins.sbt ... [info] Loading project definition from /Users/____/workspace/ws-scala/seunjeon/project [info] Loading settings from build.sbt ... [info] Loading settings from build.sbt ... [info] Set current project to seunjeon (in build file:/Users/____/workspace/ws-scala/seunjeon/) [IJ]sbt:seunjeon>

execute the clean and package commands in the sbt shell, respectively.

(sbt shell) clean command
> clean [success] Total time: 1 s, completed Nov 15, 2024 2:44:50 PM [IJ]sbt:seunjeon>
(sbt shell) package Command
> package [info] Updating {...}seunjeon... [info] Done updating. [info] Compiling 21 Scala sources and 1 Java source to /____/seunjeon/target/scala-2.12/classes ... [info] Done compiling. [info] Packaging /____/seunjeon/target/scala-2.12/seunjeon_2.12-1.5.0.jar ... [info] Done packaging. [success] Total time: 19 s, completed Nov 15, 2024 2:46:06 PM [IJ]sbt:seunjeon>

i built the class files and jar files in the target directory like below.

output
target ├── scala-2.12 │ ├── classes │ │ └── ...class files │ ├── resolution-cache │ └── seunjeon_2.12-1.5.0.jar

however, if you look at the seunjeon_2.12-1.5.0.jar file, you'll see that it's very small at 160 KB.

The jar file registered in the Maven repository is about 24 MB because it contains a dictionary.

TXT
java/main/resources/dictionalry +- termDict.dat +- dictMapper.dat +- trie.dat ...

2.2. Pre-build

the DictBuilder.scala file already adds functionality to build dictionaries.

DictBuilder.scala
def main(args: Array[String]): Unit = { clear() copyUnkDef() copyLeftIdDef() copyRightIdDef() println("compiling lexicon dictionary...") buildLexiconDict() println("compiling connection-cost dictionary...") buildConnectionCostDict() println("complete") }
  • create a dictionary inside src/main/resources/dictionary.

IntelliJ provides a UI screen to execute the main method.

running scala file

button to run it, it fails like this

TXT
Exception in thread "main" java.nio.file.NoSuchFileException: /____/seunjeon/mecab-en-dic/unk.def at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.throwAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526) at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253) at java.nio.file.Files.copy(Files.java:1274) at org.bitbucket.eunjeon.seunjeon.DictBuilder$.copyDefFile(DictBuilder.scala:82) at org.bitbucket.eunjeon.seunjeon.DictBuilder$.copyUnkDef(DictBuilder.scala:68) at org.bitbucket.eunjeon.seunjeon.DictBuilder$.main(DictBuilder.scala:54) at org.bitbucket.eunjeon.seunjeon.DictBuilder.main(DictBuilder.scala) Process finished with exit code 1

DictBuilder.main reads the .csv, .def files needed to build the dictionary inside the mecab-en-dic directory.

TXT
```txt:Project Structure . ├── mecab-en-dic -> mecab-en-dic-2.0.1-20150920

note that mecab-en-dic is a symbolic link, there is no actual directory it references.

to register neologisms in the dictionary and build a new one, you need to download the dictionary information from the https://bitbucket.org/eunjeon/mecab-ko-dic repository.

  • mecab-en-dic - provides a dictionary (2018 is the latest version)

this functionality is already provided by scripts/download-dict.sh.

download-dict.sh
... wget -O ${DICT_NAME}.tar.gz "https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/${DICT_NAME}.tar.gz" tar zxvf $DICT_NAME.tar.gz rm mecab-en-dic ln -sf $DICT_NAME mecab-en-dic

you can run it like this, but you'll need to enter the version of the dictionary you want to download.

download dictionary
$ ./scripts/download-dict.sh dictionary-version

copy the name of the most recent dictionary from the mecab-en-dict/downlaod page.

TXT
mecab-ko-dic-2.1.1-20180720.tar.gz mecab-ko-dic-2.1.0-20180716.tar.gz ...

leave out the `.tar.gz' and enter the filename as follows.

download dictionary
$ ./scripts/download-dict.sh mecab-en-dic-2.1.1-20180720

execution result

log
$ ./scripts/download-dict.sh mecab-en-dic-2.1.1-20180720 Resolving bitbucket.org (bitbucket.org) HTTP request sent, awaiting response... 302 Found ... HTTP request sent, awaiting response... 200 OK Length: 49775061 (47M) [application/x-tar] Saving to: 'mecab-en-dic-2.1.1-20180720.tar.gz' 2024-11-15 15:30:47 (7.72 MB/s) - 'mecab-en-dic-2.1.1-20180720.tar.gz' saved [49775061/49775061]

after downloading and unzipping the file, the symbolic link was also fixed.

Dictionary structure
├── mecab-en-dic -> mecab-en-dic-2.1.1-20180720 ├── mecab-ko-dic-2.1.1-20180720 ├── mecab-ko-dic-2.1.1-20180720.tar.gz
  • the mecab-en-dic links to the 2018 dictionary directory.

running DictBuilder.main again will create the dictionary files under src/main/resources/dictionary.

DictBuilder.main log
compiling lexicon dictionary... INFO: csv parsing is completed. (191 ms) INFO: terms & mapper building is completed. (800 ms) INFO: added to trie builder (523 ms) INFO: double-array trie building is completed. (21282 ms) building LexiconDict OK. (termSize = 816243 mapper size = 774556) compiling connection-cost dictionary... INFO: connectionDict loading is completed. (25395 ms) building connection cost dictionary OK. (rightSize : 3822, leftSize : 2693, size : 10292648) complete
Dictionary data
src/main/resources/ ├── char.def └── dictionary ├── connection_cost.dat ├── dictMapper.dat ├── left-id.def ├── right-id.def ├── termDict.dat ├── trie.dat └── unk.def