This code snippet implements a parser for Japanese kanji with Furigana.
Format
Every kanji is contained inside {} parenthesis. The kanji symbol is added first, followed by a separator (|) and then the furigana.
These tokens are defined in an enum in the script:
You can change these tokens by updating the characters in the enum
Here are some examples:
{漢|かん}{字|じ}
{感|かん}{情|じょう}
{柱|はしら}
The parser also accepts kana characters everywhere in the word. For example:
{食|た}べる
{飛|と}べる
{切|き}り
{姪|めい}っ{子|こ}
Kanji characters are always required to have furigana ☝️
Dependencies
The script has a dependency on the wanakana library to check for valid kanji and kana characters.
Note: This dependency is not required. You may decide to implement isKanji and isKana on your own.
Below is the code extract from wanakana for isKanji and isKana (converted from javascript to typescript). You could use the code below instead of installing wanakana as a dependency:
Result
The result contains a list of objects (KanjiWord):
For example, the string かり{気|き}まに{配|くば}にん returns the following array:
You can use _tag to distinguish between kana and kanji:
Full script
The script exports a single parser function that accepts a string as input and returns a non-empty list of KanjiWord when successful, or an error otherwise.
Here is the full script:
Testing
The parser function has been tested on multiple inputs (using vitest):
All tests are passing for some general cases of both success and error when parsing
Feel free to use this snippet in your own code 💁🏼♂️
Every week I build a new open source project, with a new language or library, and teach you how I did it, what I learned, and how you can do the same. Join me and other 600+ readers.