The development of Bulgarian structured linguistic corpus was one of the stages of the
project BalkaNet. The corpus was created along the standards set by the authors of the corpus at Brown University.The corpus consists of 1.000.805 words extracted from texts published chiefly in
electronic form. An important requirement which was strictly observed in compiling the corpus is that the corpus include original Bulgarian texts. Some exceptions, however, had to be made: romance and western text
excerpts were taken from foreign language sources translated into Bulgarian because of the lack of original Bulgarian texts in these genres.
It was decided that the corpus should be divided into 500 text
units - approximately 2000 words each, y sentence boundaries being observed. The majority of texts consist of more than 2000 words and only a small number of less than 2000.
The texts were sampled from
15 different text categories according to the model of the Brown corpus. The number of texts in each category varies: