Zipf Curves for Reuters RCV1

entitytypestokensslope
words437,130205,379,405-1.194 graph
character trigrams127,9401,212,265,893 -1.131 graph
word bigrams14,109,062204,578,537-0.847 graph
collocates53,487,0112,029,768,012-0.855 graph

The corpus used is the Reuters RCV1 corpus, generously made available by Reuters for research purposes.

Words are sequences of alphanumeric characters separated by non-alphanumeric characters. Character trigrams are sequences of three characters of any type. Word bigrams are sequences of two words. A collocate consists of an ordered tuple of two words occurring within 5 words of each other. Due to a bug, each collocate was counted twice. Words, word bigrams and collocates are case insensitive, character trigrams are case sensitive.

Frequent words

9962126 the
4669299 to
4597386 of
3980699 in
3937837 a
3401789 and
2854820 quot
2494561 said
2210513 on
2146556 s

Rare words

10109
serpro
07228
30360
edfy
zamira
mcelveen
sne
burynsky
boyed

Frequent character trigrams

11753898 th(space)
11461352 (3 spaces)
10577223 he(space)
10194924 the
6447529 ed(space)
6126626 .(2 spaces)
5968829 (space)in
5594041 on(space)
5480050 (space)to
5419308 ing

Rare character trigrams

fn)
YZ0
'Ug
98I
u+G
/T(tab)
#.O
.sr
d.V
R'D

Frequent word bigrams

1009496 inXthe
1006372 ofXthe
392532 saidXthe
380587 forXthe
378852 onXthe
363797 toXthe
354546 uXs
319069 quotXthe
269755 inXa
261691 nXa

Rare word bigrams

productsXsafely
forXdiablo
cardXjavi
comicXeffort
backgroundsXwho
liberaXsaid
diplomatsXrepresentatives
clothingXeven
orXn64
appointmentXwomen

Frequent collocates

7374980 ofXthe
5128198 inXthe
4905362 theXto
4045358 theXthe
3335894 andXthe
3203058 quotXthe
3104726 saidXthe
2759990 aXthe
2557234 onXthe
2300660 aXof

Rare collocates

582Xfond
colleyvilleXdevel
londonXseiji
condensorXkill
grofmanXremoves
341Xzypro
gauntletXpoliticians
forwardXthrifts
aboutXterrie
monthXsaraj

Back