Scientific Online Resource System

Известия на Съюза на учените – Варна. Серия „Икономически науки”

An unsupervised machine learning model for automatic syllabification of Bulgarian words

Krasen Penchev


There are a lot of definitions of the syllable, and many discussions about it's role in the structure of the spoken languages. Some linguists put it in a central place in their theories. Having in mind that every person speaking a language, which is his/hers mother tongue, can divide the words into syllables, it could be concluded that the syllable is a structural entity of the spoken languages. The automatic syllabification, at least in theory, is applicable in a broad range of problems. Unfortunately it's not as popular as one would imagine. The small number and the low quality of the training resources are the main reasons for the low adoption rate of the automatic syllabification. A model for an unsupervised automatic syllabification is presented in this report. The aim is to design a general purpose model which would address the outlined existing problems of the automatic syllabification in the context of the Bulgarian language. The presented method is not constrained by the volume of the training data or the field of knowledge it’s coming from.


syllabification, machine learning, automatic, unsupervised, model

Full Text


Bartlett, S.; Kondrak, G. and Cherry, C. (2009). On the Syllabification of Phonemes, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics : 308-316.

Duanmu, S., 2009. Syllable Structure: The Limits of Variation. OUP Oxford, .

Daz-Santiago, S.; Maria Rodriguez-Henriquez, L. and Chakraborty, D. (2014). A Cryptographic Study of Tokenization Systems, 4 : 6.

Goldwater, S. and Johnson, M. (2005). Representational Bias in Unsupervised Learning of Syllable Structure, Proceedings of the Ninth Conference on Computational Natural Language Learning : 112-119.

Goslin, J. and Frauenfelder, U. (2001). A Comparison of Theoretical and Human Syllabification, Language and Speech 44 : 409-436.

Habert, B.; Adda, G.; Adda-Decker, M.; de Marëuil, P. B.; Ferrari, S.; Ferret, O.; Illouz, G. and Paroubek, P. (1998). Towards tokenization evaluation, 98 : 427-431.

He, Y. and Kayaalp, M. (2006). A Comparison of 13 Tokenizers on MEDLINE, .

Huang, X.; Acero, A. and Hon, H.-W., 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River, NJ, USA.

Kahn, D., 1980. Syllable-based Generalization in English Phonology. Garland, .

Marchand, Y.; Adsett, C. and Damper, R. (2007). Evaluating automatic syllabification algorithms for English, Proceedings of SSW6 .

Marchand, Y.; Adsett, C. and Damper, R. (2009). Automatic Syllabification in English: A Comparison of Different Algorithms, Language and speech 52 : 1-27.

Mayer, T. (2010). Toward a Totally Unsupervised, Language-Independent Method for the Syllabification of Written Texts, Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, SIGMORPHON 2010, Uppsala, Sweden, July 15, 2010 : 63-71.

Müller, K. (2006). Improving Syllabification Models with Phonotactic Knowledge, Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology : 11-20.

Reitermanov, Z. (2010). Data splitting, WDS'10 Proceedings of Contributed Papers : 31-36.

Rogova, K.; Demuynck, K. and Van Compernolle, D. (2013). Automatic syllabification using segmental conditional random fields, COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL 3 : 34-48.

Selkirk, E. (1984). On the major class features and syllable theory, Language Sound Structure .

Sulov, V. (2014). On the Essence of Hardware Performance, Research Journal of Economics, Business and ICT 9 : 13-18.


Font Size