Resources
Home
News
Products
Resources
Research
Education
Information
Contact
Links

На български

Resourses

As part of its activities the Bulgarian Associaton for Computational Linguistics has created various linguistic resources, needed for the implementation of application products. The resources include reading materials such as lectures and articles as well as electronic bases of text corpora, phonetic, morphological and syntactic rules, etc. Besides their educational value the resources give opportunities for participation in future research and development projects on a par with international teams.

Corpus of Bulgarian language texts 
The corpus was collected within the BalkaNet project. It is structured along the standards of the Brown University Corpus and comprises 1000805 words extracted mainly from electronic texts. In the creation of the corpus the requirement was observed for including only original Bulgarian texts.The corpus consists of 500 text units belonging to 15 categories, each unit being approximately 2000 words long. More information about the corpus is available here

Frequency Dictionary of Bulgarian word forms
The Frequency Dictionary was created after analysis of texts of around 30 mln. words and comprises approximately 230000 word forms. The words are ordered in terms of frequency, each being used at least two times in the corpus. According to the information taken from the corpus the most frequent word in the Bulgarian language is the preposition "na" which has 45.29 occurrences in 1000 words, followed by the conjunction "i" with 30.81 occurrences.

Lectures by John Sinclair, professor in Language studies

Prof. J. Sinclair is the ideologist of the Collins Cobuild Dictionary and a leading scholar in Corpus Linguistics. The lectures were delivered  from 21 to 24 OCtober 2002 within a seminar in the Sofia University "St. Kliment Ohridski". The main accents of the lectures were put on the creation, processing and analysis of texts, the lexical item and its semantics, the origin of meaning, the questions of speech as well as stylistic problems. More information about the lectures is available here.

Lectures by prof. Max Silberstein

Prof. Max Silberstein is the author of the INTEX system. The lectures which present comprehensively the possibilities of the system were specially given to the Master`s programme in Computational Linguistics. The full text can be downloaded from here.

Lectures by prof. Kjetil Ra Hauge

The famous phylologiest from Oslo University, vice president of The Bulgarian Studies Association prof. Kjetil Ra Hauge gave the lecture "Corpuses and Corpus Linguistics in Norway" in the framework of the Master's program in Computational Linguistics. The full text of the lectures can be found here.

Selected publications

  • Christian Strohmaier, Christoph Ringlstetter, Klaus U. Schulz and Stoyan Mihov, Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary? Proceedings of the 7th International Conference on Document Analysis and Recognition ICDAR'03.
  • Tinko Tinchev, Stoyan Mihov, Svetla Koeva, Angel Genov, Logic for WordNet. Annuaire Univ. Sofia, Fac. Math. Inf., vol. 95, 2002 (in print).
  • Klaus U. Schulz, Stoyan Mihov, Fast string correction with Levenshtein automata, IJDAR 5 (2002) 1, 67-85 Paper
  • Stoyan Mihov, Denis Maurel, Direct Construction of Minimal Acyclic Subsequential Transducers, Implementation and Application of Automata, S. Yu, A. Pun (Eds.), LNCS 2088, Springer 2001.
    gzipped postscript (140KB)
  • Svetla Koeva, Rules for end-of-line word hyphenation, Bulgarian language magazine, 2000, book2
  • Jan Daciuk, Stoyan Mihov, Bruce Watson and Richard Watson, Incremental Construction of Minimal Acyclic Finite State Automata, Computational Linguistics, Volume 26, Issue 1, March 2000. gzipped postscript (58KB)
  • Svetla Koeva, Grammar Dictionary of the Bulgarian Language. Description of the principles of organization of the linguistic data. Bulgarian language magazine, 1998, book 6.
[Home] [News] [Products] [Resources] [Research] [Education] [Information] [Contact] [Links]