Word frequency data


Most of the information at this website deals with data from the COCA corpus. You might also be interested in the word frequency data from the  14 billion word iWeb corpus.

This site contains what is probably the most accurate word frequency data for English. The data is based on the one billion word Corpus of Contemporary American English (COCA) -- the only corpus of English that is large, up-to-date, and balanced between many genres.

When you purchase the data, you have access to four different datasets, and you can use whichever ones are the most useful for you. Short samples are given below for each of these datasets, and you can also see much more complete samples (every tenth entry), as well as free copies of the top 5,000 entries for each list.

 1  The most basic data shows the frequency of each of the top 60,000 words (lemmas) in each of the eight main genres in the corpus. Unlike word frequency data that is just based on web pages, the COCA data lets you see the frequency across genres, to know if the word is more informal (e.g. blogs or TV and movies subtitles) or more formal (e.g. academic). The following are just a few entries of words at different frequency levels (rank), 1-60,000. 

rank lemma PoS freq # texts disp BLOG WEB TV/M SPOK FIC MAG NEWS ACAD
614 describe v 159521 81551 0.94 17718 25573 4270 15796 7866 20583 18065 49650
615 guess v 159454 74761 0.96 21706 15355 57719 28378 23413 5878 5453 1552
616 choice n 159277 82417 0.98 28879 23742 17114 16776 10416 21726 17835 22789
617 source n 158588 74656 0.95 23426 28366 5171 11870 4764 26220 18954 39817
618 mom n 158511 44697 0.95 12884 10313 66877 19934 25394 13766 8432 911
619 soon r 158194 95115 0.98 18711 19451 26696 14647 29098 22532 17933 9126
620 director n 158028 79105 0.94 14248 18521 5554 19383 5197 28196 51981 14948
15024 redhead n 1766 1209 0.90 96 101 432 86 761 167 95 28
15025 despair v 1766 1637 0.95 250 290 127 104 330 343 172 150
15026 pretentious j 1766 1420 0.94 293 384 261 93 252 203 185 95
15027 disservice n 1765 1580 0.94 437 292 53 322 44 206 276 135
15028 childlike j 1765 1510 0.94 209 232 94 106 498 239 214 172
15029 complicit j 1765 1450 0.93 510 330 83 230 99 164 131 218
15030 macaroni n 1765 1170 0.92 84 101 317 125 315 393 412 18
rank lemma PoS freq # texts disp BLOG WEB TV/M SPOK FIC MAG NEWS ACAD
30005 glutamate n 372 159 0.77 40 67 11 19 0 101 8 126
30006 twisty j 372 341 0.89 50 42 26 6 100 104 36 8
30007 lyricism n 372 301 0.87 35 49 5 19 23 67 67 107
30008 peppery j 372 331 0.86 17 16 5 15 68 114 134 3
30009 firebird n 372 69 0.15 15 7 21 1 305 8 11 4
30010 wuss n 372 323 0.90 55 36 188 21 35 28 8 1
30011 strafe v 372 319 0.89 24 53 32 24 79 83 62 15
45003 thawing n 115 102 0.81 5 11 2 10 18 26 21 22
45004 sugarless j 115 97 0.82 12 7 20 1 21 38 12 4
45005 fold-up j 115 107 0.83 5 7 6 5 41 29 20 2
45006 energizing j 115 112 0.84 14 14 2 10 3 44 12 16
45007 endoplasmic j 115 65 0.64 5 36 4 0 0 14 0 56
45008 ejector n 115 93 0.80 10 9 27 3 8 41 7 10
45009 saliency n 115 76 0.69 3 9 0 6 0 1 1 95
rank lemma PoS freq # texts disp BLOG WEB TV/M SPOK FIC MAG NEWS ACAD
60026 exudative j 45 16 0.25 1 8 2 0 1 5 0 28
60027 shakti n 45 21 0.44 25 1 0 0 0 10 3 6
60028 shearling j 45 41 0.73 2 1 1 3 15 18 4 1
60029 sheerly r 45 45 0.77 3 8 1 6 9 8 3 7
60030 short-selling n 45 37 0.67 4 11 1 1 0 10 14 4
60031 phytic j 45 19 0.48 10 16 0 0 0 4 0 15
60032 piedmontese j 45 31 0.68 2 13 1 0 5 6 10 8

 2  Another dataset shows the frequency not only in the eight main genres, but also in nearly 100 "sub-genres" (Magazine-Sports, Newspaper-Finance, Academic-Medical, Web-Reviews, Blogs-Personal, or TV-Comedies, etc).

 3  A third dataset shows the frequency of the word forms of the top 60,000 lemmas:

lemmaRank lemma PoS lemFreq wordFreq word form
13164 rehabilitate v 2286 1033 rehabilitate
13164 rehabilitate v 2286 749 rehabilitated
13164 rehabilitate v 2286 452 rehabilitating
13164 rehabilitate v 2286 52 rehabilitates
13165 subprime j 2286 2079 subprime
13165 subprime j 2286 207 sub-prime
13166 headline v 2285 999 headline
13166 headline v 2285 943 headlined
13166 headline v 2285 343 headlining
13167 blue-collar j 2285 2262 blue-collar
13167 blue-collar j 2285 23 bluecollar
13168 deduce v 2285 1088 deduce
13168 deduce v 2285 965 deduced
13168 deduce v 2285 136 deducing
13168 deduce v 2285 96 deduces
lemmaRank lemma PoS lemFreq wordFreq word form
13169 oats n 2284 2284 oats
13170 stand-up j 2284 2284 stand-up
13171 squeak v 2283 894 squeaked
13171 squeak v 2283 593 squeaking
13171 squeak v 2283 583 squeak
13171 squeak v 2283 213 squeaks
13172 naming n 2283 2254 naming
13172 naming n 2283 29 namings
13173 toad n 2283 1483 toad
13173 toad n 2283 800 toads
13174 clockwise r 2282 2282 clockwise

 4  A final dataset shows the top 219,000 words in the billion word corpus -- each word that occurs at least 20 times and in 5 different texts. In this list, the words are not lemmatized (e.g. each form of a word is listed separately from other forms) and the words are not tagged for part of speech. For each word, it shows in which genres it is the most common (again, to show +/- formal) and what percent are capitalized (useful for determining +/- proper noun; see daymond and dentzer below).

word rank word freq # texts % caps BLOG WEB TV/M SPOK FIC MAG NEWS ACAD
100033 datatypes 89 20 0.18 8 74 0 0 0 0 0 7
100034 daymond 89 40 1.00 13 9 0 30 1 16 17 3
100035 deductively 89 68 0.03 4 18 3 0 0 3 1 60
100036 delp 89 25 1.00 2 5 2 0 52 10 12 6
100037 demoed 89 81 0.02 24 18 7 4 2 33 0 0
100038 dentzer 89 40 1.00 0 20 0 46 0 20 2 1
100039 denys 89 53 0.94 2 18 1 0 2 28 14 23
100040 despatch 89 50 0.17 6 38 6 0 17 12 1 9
100041 digged 89 33 0.04 6 69 6 1 2 1 1 0
100042 diigo 89 32 0.98 10 12 0 0 0 0 0 67
100043 dilator 89 25 0.02 0 2 5 1 6 5 0 70
100044 dimitroff 89 43 1.00 4 4 0 0 0 23 54 4
100045 disasterous 89 86 0.01 41 42 1 0 1 0 3 1
100046 disgracefully 89 83 0.12 18 11 7 10 11 18 5 8
word rank word freq # texts % caps BLOG WEB TV/M SPOK FIC MAG NEWS ACAD
100047 dispersant 89 43 0.07 9 25 8 26 0 6 3 12
100048 do-overs 89 72 0.11 18 11 11 15 2 10 12 4
100049 docter 89 48 0.88 7 22 12 12 0 9 27 0
100050 dollarization 89 20 0.10 2 52 0 2 0 3 10 20
100051 doohickey 89 72 0.11 9 8 53 0 9 7 2 0
100052 doozies 89 80 0.01 21 9 13 10 9 14 7 1
100053 dorrie 89 26 1.00 1 2 7 2 64 2 11 0
100054 dort 89 39 0.82 1 6 20 7 5 2 16 29
100055 doster 89 41 1.00 1 7 0 0 0 6 62 13
100056 dowler 89 40 1.00 4 14 0 13 6 17 20 15
100057 drainer 89 71 0.12 3 11 14 4 40 9 3 5
100058 drams 89 56 0.51 12 2 4 1 11 16 5 38
100059 drina 89 42 1.00 1 1 4 14 33 12 19 5
100060 druggy 89 74 0.10 11 3 8 8 17 21 11 1