Zipf Curves for Reuters RCV1
| entity | types | tokens | slope | |
| words | 437,130 | 205,379,405 | -1.194 |
graph |
| character trigrams | 127,940 | 1,212,265,893 | -1.131 |
graph |
| word bigrams | 14,109,062 | 204,578,537 | -0.847 |
graph |
| collocates | 53,487,011 | 2,029,768,012 | -0.855 |
graph |
The corpus used is the
Reuters
RCV1 corpus,
generously made available by Reuters for research purposes.
Words are sequences of alphanumeric characters
separated by non-alphanumeric
characters. Character trigrams are sequences of
three characters of any type. Word bigrams are
sequences of two words. A collocate consists of
an ordered tuple of two words occurring within 5 words of each
other. Due to a bug, each collocate was counted twice. Words, word bigrams and collocates are case
insensitive, character trigrams are case
sensitive.
Frequent words
9962126 the
4669299 to
4597386 of
3980699 in
3937837 a
3401789 and
2854820 quot
2494561 said
2210513 on
2146556 s
Rare words
10109
serpro
07228
30360
edfy
zamira
mcelveen
sne
burynsky
boyed
Frequent character trigrams
11753898 th(space)
11461352 (3 spaces)
10577223 he(space)
10194924 the
6447529 ed(space)
6126626 .(2 spaces)
5968829 (space)in
5594041 on(space)
5480050 (space)to
5419308 ing
Rare character trigrams
fn)
YZ0
'Ug
98I
u+G
/T(tab)
#.O
.sr
d.V
R'D
Frequent word bigrams
1009496 inXthe
1006372 ofXthe
392532 saidXthe
380587 forXthe
378852 onXthe
363797 toXthe
354546 uXs
319069 quotXthe
269755 inXa
261691 nXa
Rare word bigrams
productsXsafely
forXdiablo
cardXjavi
comicXeffort
backgroundsXwho
liberaXsaid
diplomatsXrepresentatives
clothingXeven
orXn64
appointmentXwomen
Frequent collocates
7374980 ofXthe
5128198 inXthe
4905362 theXto
4045358 theXthe
3335894 andXthe
3203058 quotXthe
3104726 saidXthe
2759990 aXthe
2557234 onXthe
2300660 aXof
Rare collocates
582Xfond
colleyvilleXdevel
londonXseiji
condensorXkill
grofmanXremoves
341Xzypro
gauntletXpoliticians
forwardXthrifts
aboutXterrie
monthXsaraj
Back