Most of the
information at this website deals with data from the COCA
corpus. You might also be interested in the
word frequency data from the 14
billion word
iWeb
corpus. |
This site contains what is probably the
most accurate
word frequency data for English. The data is based on the one
billion word Corpus of Contemporary American English (COCA)
-- the only corpus of English that is large, up-to-date, and
balanced between many genres.
When you
purchase the data, you have access to four different datasets, and you can
use whichever ones are the most useful for you. Short samples are given below
for each of these datasets, and you can also see much more
complete samples (every tenth entry), as well as free copies of the
top 5,000 entries for each list.
1
The most basic data shows the frequency of each of the top 60,000 words (lemmas)
in each of the eight main genres in the corpus. Unlike word frequency data that
is just based on web pages, the COCA data lets you see the frequency across genres, to know if the
word is more informal (e.g. blogs or TV and movies subtitles) or more formal
(e.g. academic). The following are just a few entries of words at different
frequency levels (rank), 1-60,000.
rank |
lemma |
PoS |
freq |
# texts |
disp |
BLOG |
WEB |
TV/M |
SPOK |
FIC |
MAG |
NEWS |
ACAD |
614 |
describe |
v |
159521 |
81551 |
0.94 |
17718 |
25573 |
4270 |
15796 |
7866 |
20583 |
18065 |
49650 |
615 |
guess |
v |
159454 |
74761 |
0.96 |
21706 |
15355 |
57719 |
28378 |
23413 |
5878 |
5453 |
1552 |
616 |
choice |
n |
159277 |
82417 |
0.98 |
28879 |
23742 |
17114 |
16776 |
10416 |
21726 |
17835 |
22789 |
617 |
source |
n |
158588 |
74656 |
0.95 |
23426 |
28366 |
5171 |
11870 |
4764 |
26220 |
18954 |
39817 |
618 |
mom |
n |
158511 |
44697 |
0.95 |
12884 |
10313 |
66877 |
19934 |
25394 |
13766 |
8432 |
911 |
619 |
soon |
r |
158194 |
95115 |
0.98 |
18711 |
19451 |
26696 |
14647 |
29098 |
22532 |
17933 |
9126 |
620 |
director |
n |
158028 |
79105 |
0.94 |
14248 |
18521 |
5554 |
19383 |
5197 |
28196 |
51981 |
14948 |
15024 |
redhead |
n |
1766 |
1209 |
0.90 |
96 |
101 |
432 |
86 |
761 |
167 |
95 |
28 |
15025 |
despair |
v |
1766 |
1637 |
0.95 |
250 |
290 |
127 |
104 |
330 |
343 |
172 |
150 |
15026 |
pretentious |
j |
1766 |
1420 |
0.94 |
293 |
384 |
261 |
93 |
252 |
203 |
185 |
95 |
15027 |
disservice |
n |
1765 |
1580 |
0.94 |
437 |
292 |
53 |
322 |
44 |
206 |
276 |
135 |
15028 |
childlike |
j |
1765 |
1510 |
0.94 |
209 |
232 |
94 |
106 |
498 |
239 |
214 |
172 |
15029 |
complicit |
j |
1765 |
1450 |
0.93 |
510 |
330 |
83 |
230 |
99 |
164 |
131 |
218 |
15030 |
macaroni |
n |
1765 |
1170 |
0.92 |
84 |
101 |
317 |
125 |
315 |
393 |
412 |
18 |
rank |
lemma |
PoS |
freq |
# texts |
disp |
BLOG |
WEB |
TV/M |
SPOK |
FIC |
MAG |
NEWS |
ACAD |
30005 |
glutamate |
n |
372 |
159 |
0.77 |
40 |
67 |
11 |
19 |
0 |
101 |
8 |
126 |
30006 |
twisty |
j |
372 |
341 |
0.89 |
50 |
42 |
26 |
6 |
100 |
104 |
36 |
8 |
30007 |
lyricism |
n |
372 |
301 |
0.87 |
35 |
49 |
5 |
19 |
23 |
67 |
67 |
107 |
30008 |
peppery |
j |
372 |
331 |
0.86 |
17 |
16 |
5 |
15 |
68 |
114 |
134 |
3 |
30009 |
firebird |
n |
372 |
69 |
0.15 |
15 |
7 |
21 |
1 |
305 |
8 |
11 |
4 |
30010 |
wuss |
n |
372 |
323 |
0.90 |
55 |
36 |
188 |
21 |
35 |
28 |
8 |
1 |
30011 |
strafe |
v |
372 |
319 |
0.89 |
24 |
53 |
32 |
24 |
79 |
83 |
62 |
15 |
45003 |
thawing |
n |
115 |
102 |
0.81 |
5 |
11 |
2 |
10 |
18 |
26 |
21 |
22 |
45004 |
sugarless |
j |
115 |
97 |
0.82 |
12 |
7 |
20 |
1 |
21 |
38 |
12 |
4 |
45005 |
fold-up |
j |
115 |
107 |
0.83 |
5 |
7 |
6 |
5 |
41 |
29 |
20 |
2 |
45006 |
energizing |
j |
115 |
112 |
0.84 |
14 |
14 |
2 |
10 |
3 |
44 |
12 |
16 |
45007 |
endoplasmic |
j |
115 |
65 |
0.64 |
5 |
36 |
4 |
0 |
0 |
14 |
0 |
56 |
45008 |
ejector |
n |
115 |
93 |
0.80 |
10 |
9 |
27 |
3 |
8 |
41 |
7 |
10 |
45009 |
saliency |
n |
115 |
76 |
0.69 |
3 |
9 |
0 |
6 |
0 |
1 |
1 |
95 |
rank |
lemma |
PoS |
freq |
# texts |
disp |
BLOG |
WEB |
TV/M |
SPOK |
FIC |
MAG |
NEWS |
ACAD |
60026 |
exudative |
j |
45 |
16 |
0.25 |
1 |
8 |
2 |
0 |
1 |
5 |
0 |
28 |
60027 |
shakti |
n |
45 |
21 |
0.44 |
25 |
1 |
0 |
0 |
0 |
10 |
3 |
6 |
60028 |
shearling |
j |
45 |
41 |
0.73 |
2 |
1 |
1 |
3 |
15 |
18 |
4 |
1 |
60029 |
sheerly |
r |
45 |
45 |
0.77 |
3 |
8 |
1 |
6 |
9 |
8 |
3 |
7 |
60030 |
short-selling |
n |
45 |
37 |
0.67 |
4 |
11 |
1 |
1 |
0 |
10 |
14 |
4 |
60031 |
phytic |
j |
45 |
19 |
0.48 |
10 |
16 |
0 |
0 |
0 |
4 |
0 |
15 |
60032 |
piedmontese |
j |
45 |
31 |
0.68 |
2 |
13 |
1 |
0 |
5 |
6 |
10 |
8 |
2
Another dataset shows the frequency not only in the
eight main genres, but also in nearly 100 "sub-genres" (Magazine-Sports,
Newspaper-Finance, Academic-Medical, Web-Reviews, Blogs-Personal, or
TV-Comedies, etc).
3 A third dataset shows the frequency of the word forms of the
top 60,000 lemmas:
lemmaRank |
lemma |
PoS |
lemFreq |
wordFreq |
word form |
13164 |
rehabilitate |
v |
2286 |
1033 |
rehabilitate |
13164 |
rehabilitate |
v |
2286 |
749 |
rehabilitated |
13164 |
rehabilitate |
v |
2286 |
452 |
rehabilitating |
13164 |
rehabilitate |
v |
2286 |
52 |
rehabilitates |
13165 |
subprime |
j |
2286 |
2079 |
subprime |
13165 |
subprime |
j |
2286 |
207 |
sub-prime |
13166 |
headline |
v |
2285 |
999 |
headline |
13166 |
headline |
v |
2285 |
943 |
headlined |
13166 |
headline |
v |
2285 |
343 |
headlining |
13167 |
blue-collar |
j |
2285 |
2262 |
blue-collar |
13167 |
blue-collar |
j |
2285 |
23 |
bluecollar |
13168 |
deduce |
v |
2285 |
1088 |
deduce |
13168 |
deduce |
v |
2285 |
965 |
deduced |
13168 |
deduce |
v |
2285 |
136 |
deducing |
13168 |
deduce |
v |
2285 |
96 |
deduces |
lemmaRank |
lemma |
PoS |
lemFreq |
wordFreq |
word form |
13169 |
oats |
n |
2284 |
2284 |
oats |
13170 |
stand-up |
j |
2284 |
2284 |
stand-up |
13171 |
squeak |
v |
2283 |
894 |
squeaked |
13171 |
squeak |
v |
2283 |
593 |
squeaking |
13171 |
squeak |
v |
2283 |
583 |
squeak |
13171 |
squeak |
v |
2283 |
213 |
squeaks |
13172 |
naming |
n |
2283 |
2254 |
naming |
13172 |
naming |
n |
2283 |
29 |
namings |
13173 |
toad |
n |
2283 |
1483 |
toad |
13173 |
toad |
n |
2283 |
800 |
toads |
13174 |
clockwise |
r |
2282 |
2282 |
clockwise |
4
A final dataset shows the top 219,000 words in the billion word corpus -- each word that occurs at least 20 times
and in 5 different texts. In this list, the words are not lemmatized (e.g. each
form of a word is listed separately from other forms) and the words are not
tagged for part of speech. For each word, it shows in which genres it is the
most common (again, to show +/- formal) and what percent are capitalized
(useful for determining +/- proper noun; see daymond and dentzer
below).
word rank |
word |
freq |
# texts |
% caps |
BLOG |
WEB |
TV/M |
SPOK |
FIC |
MAG |
NEWS |
ACAD |
100033 |
datatypes |
89 |
20 |
0.18 |
8 |
74 |
0 |
0 |
0 |
0 |
0 |
7 |
100034 |
daymond |
89 |
40 |
1.00 |
13 |
9 |
0 |
30 |
1 |
16 |
17 |
3 |
100035 |
deductively |
89 |
68 |
0.03 |
4 |
18 |
3 |
0 |
0 |
3 |
1 |
60 |
100036 |
delp |
89 |
25 |
1.00 |
2 |
5 |
2 |
0 |
52 |
10 |
12 |
6 |
100037 |
demoed |
89 |
81 |
0.02 |
24 |
18 |
7 |
4 |
2 |
33 |
0 |
0 |
100038 |
dentzer |
89 |
40 |
1.00 |
0 |
20 |
0 |
46 |
0 |
20 |
2 |
1 |
100039 |
denys |
89 |
53 |
0.94 |
2 |
18 |
1 |
0 |
2 |
28 |
14 |
23 |
100040 |
despatch |
89 |
50 |
0.17 |
6 |
38 |
6 |
0 |
17 |
12 |
1 |
9 |
100041 |
digged |
89 |
33 |
0.04 |
6 |
69 |
6 |
1 |
2 |
1 |
1 |
0 |
100042 |
diigo |
89 |
32 |
0.98 |
10 |
12 |
0 |
0 |
0 |
0 |
0 |
67 |
100043 |
dilator |
89 |
25 |
0.02 |
0 |
2 |
5 |
1 |
6 |
5 |
0 |
70 |
100044 |
dimitroff |
89 |
43 |
1.00 |
4 |
4 |
0 |
0 |
0 |
23 |
54 |
4 |
100045 |
disasterous |
89 |
86 |
0.01 |
41 |
42 |
1 |
0 |
1 |
0 |
3 |
1 |
100046 |
disgracefully |
89 |
83 |
0.12 |
18 |
11 |
7 |
10 |
11 |
18 |
5 |
8 |
word rank |
word |
freq |
# texts |
% caps |
BLOG |
WEB |
TV/M |
SPOK |
FIC |
MAG |
NEWS |
ACAD |
100047 |
dispersant |
89 |
43 |
0.07 |
9 |
25 |
8 |
26 |
0 |
6 |
3 |
12 |
100048 |
do-overs |
89 |
72 |
0.11 |
18 |
11 |
11 |
15 |
2 |
10 |
12 |
4 |
100049 |
docter |
89 |
48 |
0.88 |
7 |
22 |
12 |
12 |
0 |
9 |
27 |
0 |
100050 |
dollarization |
89 |
20 |
0.10 |
2 |
52 |
0 |
2 |
0 |
3 |
10 |
20 |
100051 |
doohickey |
89 |
72 |
0.11 |
9 |
8 |
53 |
0 |
9 |
7 |
2 |
0 |
100052 |
doozies |
89 |
80 |
0.01 |
21 |
9 |
13 |
10 |
9 |
14 |
7 |
1 |
100053 |
dorrie |
89 |
26 |
1.00 |
1 |
2 |
7 |
2 |
64 |
2 |
11 |
0 |
100054 |
dort |
89 |
39 |
0.82 |
1 |
6 |
20 |
7 |
5 |
2 |
16 |
29 |
100055 |
doster |
89 |
41 |
1.00 |
1 |
7 |
0 |
0 |
0 |
6 |
62 |
13 |
100056 |
dowler |
89 |
40 |
1.00 |
4 |
14 |
0 |
13 |
6 |
17 |
20 |
15 |
100057 |
drainer |
89 |
71 |
0.12 |
3 |
11 |
14 |
4 |
40 |
9 |
3 |
5 |
100058 |
drams |
89 |
56 |
0.51 |
12 |
2 |
4 |
1 |
11 |
16 |
5 |
38 |
100059 |
drina |
89 |
42 |
1.00 |
1 |
1 |
4 |
14 |
33 |
12 |
19 |
5 |
100060 |
druggy |
89 |
74 |
0.10 |
11 |
3 |
8 |
8 |
17 |
21 |
11 |
1 |
|