|
Resourses As part of its activities the Bulgarian Associaton for Computational Linguistics has created various linguistic resources, needed for the implementation
of application products. The resources include reading materials such as lectures and articles as well as electronic bases of text corpora, phonetic, morphological and syntactic rules, etc. Besides their educational
value the resources give opportunities for participation in future research and development projects on a par with international teams. Corpus of Bulgarian language texts The corpus was collected
within the BalkaNet project. It is structured along the standards of the Brown University Corpus and comprises 1000805 words extracted mainly from electronic texts. In the creation of the corpus the requirement was
observed for including only original Bulgarian texts.The corpus consists of 500 text units belonging to 15 categories, each unit being approximately 2000 words long.
More information about the corpus is available here
Frequency Dictionary of Bulgarian word forms. The Frequency Dictionary was created after analysis of texts of around 30 mln. words and comprises
approximately 230000 word forms. The words are ordered in terms of frequency, each being used at least two times in the corpus. According to the information taken from the corpus the most frequent word in the Bulgarian
language is the preposition "na" which has 45.29 occurrences in 1000 words, followed by the conjunction "i" with 30.81 occurrences.Lectures by John Sinclair, professor in Language studies Prof. J. Sinclair
is the ideologist of the Collins Cobuild Dictionary and a leading scholar in Corpus Linguistics. The lectures were delivered from 21 to 24 OCtober 2002 within a seminar in the Sofia University "St. Kliment
Ohridski". The main accents of the lectures were put on the creation, processing and analysis of texts, the lexical item and its semantics, the origin of meaning, the questions of speech as well as stylistic problems.
More information about the lectures is available here.
Lectures by prof. Max Silberstein Prof. Max Silberstein is the author of the INTEX system. The lectures which present comprehensively the possibilities of the system were
specially given to the Master`s programme in Computational Linguistics. The full text can be downloaded from here.
Lectures by prof. Kjetil Ra Hauge The famous phylologiest from Oslo University, vice president of The Bulgarian Studies Association prof. Kjetil Ra Hauge gave the lecture "Corpuses and Corpus
Linguistics in Norway" in the framework of the Master's program in Computational Linguistics. The full text of the lectures can be found here.
Selected publications
Christian Strohmaier, Christoph Ringlstetter, Klaus U. Schulz and Stoyan Mihov, Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary? Proceedings of the 7th
International Conference on Document Analysis and Recognition ICDAR'03.
- Tinko Tinchev, Stoyan Mihov, Svetla Koeva, Angel Genov, Logic for WordNet. Annuaire Univ. Sofia, Fac. Math. Inf., vol. 95, 2002 (in print).
- Klaus U. Schulz, Stoyan Mihov, Fast string correction with Levenshtein automata, IJDAR 5 (2002) 1, 67-85
Paper
- Stoyan Mihov, Denis Maurel, Direct Construction of Minimal Acyclic Subsequential Transducers, Implementation and Application of Automata, S. Yu, A. Pun (Eds.), LNCS 2088, Springer 2001.
gzipped postscript (140KB)
- Svetla Koeva, Rules for end-of-line word hyphenation, Bulgarian language magazine, 2000, book2
Jan Daciuk, Stoyan Mihov, Bruce Watson and Richard Watson, Incremental Construction of Minimal Acyclic Finite State Automata, Computational Linguistics, Volume 26, Issue 1, March 2000. gzipped postscript (58KB)
Svetla Koeva, Grammar Dictionary of the Bulgarian Language. Description of the principles of organization of the linguistic data. Bulgarian language magazine, 1998, book 6.
|