CJKV Smaji

Situations

1 Ideographic variation

Unicode had setup IVD to record variation information. But there are some shortages: the included ideographs are far from being complete. We may introduce new glyphs on demand. But it'll take a long period of time to go through a routine and take a long period of time waiting for it to be published.

All of this makes it inadequate for scholars who need to deal with new glyphs on demand in no time, so in some cases the de-facto encoding used by scholars is raw image, which is not suitable for batch processing, efficient data exchange, indexing and searching. Hence it's now not possible to accurately digitalize these raw information to build an all-embracing database of ancient literatures.

Besides, whether it's already included? Whether it's a new-found character? Whether it's a variation of an included character? These questions should be verified. To include a new character is not an easy task. Especailly when the quantity of the included characters is almost hundreds of thousands.

2 Font

After a new version of unicode is published and the softwares are updated, characters can be processed by programs automatically. But we still lack glyph info, which we build fonts on. Characters can be viewed only when font files are setup.

It requires a lot of manpower to design and create a font containing hundreds of thousands glyphs.

3 Input method

Encoding, displaying and ..? Yes, we need an input method to input these characters.

Inputting of common daily used characters is a solved problem. But scholars are not fortunate and still input characters by drawing, taking photos, cutting and pasting images, just like how our predecessors did.

4 Information distribution

After new glyphs are included and published, operating system, system software, word processor developers will adopt these newly published unicode technical reports and database, then develop updated version of infrastructure and softwares. Which is unlikely immediate delivery. The strategy, that releases updated items in one big release, implies that we lack a tool to sync data items among all users in progressive manner. So partially updated items are of no use because they are not widely recognized.

Solutions

1 Ideographic variation

Smaji CJKV Glyph Sample

In order to manage the glyphs, the cjkv sample glyph library is set up, which includes the corresponding pictures of encoded glyphs. To quickly import new glyphs, users can submit an issue to start the including process. The new glyphs will first be tested by AI to check whether they have been repeatedly included, the similarity report with the existing glyphs and then they can be included after manual inspection. The included variants are encoded with variantion selector in descending order to keep compatibility with Unicode. As for new characters, they are included in private area to avoid future conflict between the Unicode. After Unicode has included that character in the future, a new entry will be recorded in a mapping database to reflect their association.

2 Font

With the sample font library, the production of sample fonts can be carried out. We just need to overcome or work around a small problem - the 64k limit

Another direction that can be developed is AI-assisted font design. In the future development, we will explore community-driven, AI-assisted fonts cocreation.

3 Input method

The input methods of Chinese character are divided into several categories such as phonetic-based, shape-based, hybrid, irrational code, handwriting recognition and voice recognition. Because there are a large number of homophones, shape-based and irrational code are more suitable for inputing for large character set.

The irrational code is suitable to the processing of uncoded characters (haven't encoded by the input methods) and difficult-to-input words. Normally, we can adopt a more convenient shape based input method. For a large character set, the logical consistency of the input method is very important. In this case, the design of Zhengma and Cangjie meets the requirements very well.

Smaji CJKV Glyph Sample

This is the project to collect input method codes. You are welcome to submit input method codes.

4 Information distribution

The basis of communication is consensus, common grammar, common vocabulary, and common encoding. All kinds of information, codes, fonts, and input method codes we input need to quickly reach a consensus among users. Therefore, we have also built a cross-platform synchronization program to synchronize the corresponding data. In addition, a font generation service is also provided to generate custom fonts for text publishing.