Data Resources
Home
News
Products
Resources
Research
Education
Information
Contact
Links

На български

Structured Bulgarian Corpus

The development of Bulgarian structured linguistic corpus was one of the stages of the project BalkaNet. The corpus was created along the standards set by the authors of the corpus at Brown University.

The corpus consists of 1.000.805 words extracted from texts published chiefly in electronic form. An important requirement which was strictly observed in compiling the corpus is that the corpus include original Bulgarian texts. Some exceptions, however, had to be made: romance and western text excerpts were taken from foreign language sources translated into Bulgarian because of the lack of original Bulgarian texts in these genres.

It was decided that the corpus should be divided into 500 text units - approximately 2000 words each, y sentence boundaries being observed. The majority of texts consist of more than 2000 words and only a small number of less than 2000.

The texts were sampled from 15 different text categories according to the model of the Brown corpus. The number of texts in each category varies:

 

Category

Texts

 

Category

Texts

A.

PRESS: REPORTAGE

44

J.

LEARNED

80

B.

PRESS: EDITORIAL

27

K.

FICTION: GENERAL

29

C.

PRESS: REVIEWS

17

L.

FICTION: MYSTERY

24

D.

RELIGION

17

M.

FICTION: SCIENCE

6

E.

SKILL AND HOBBIES

36

N.

FICTION: ADVENTURE

29

F.

POPULAR LORE

48

P.

FICTION: ROMANCE

29

G.

BELLES-LETTRES

75

R.

HUMOR

9

H.

MISCELLANEOUS: GOVERNMENT & HOUSE ORGANS

30

     

[Home] [News] [Products] [Resources] [Research] [Education] [Information] [Contact] [Links]