"Phrases in English" FAQ
- on this site n-grams means sequences of n words as
defined here. In this database, n
can be any number in the range 1-8, i.e. from individual words up to
eight-word phrases. Only words and phrases occurring at least three times in the
BNC are included here. Relatively frequent n-grams are typically familiar
building blocks of English; such recurrent n-grams are also known as lexical
bundles, lexical chains or clusters. <<add references>> Shorthand forms like 1-gram, 2-gram,
3-gram etc. specify the value of n; some prefer
unigram, bigram, trigram etc. In information retrieval and
computational linguistics contexts, the term n-gram more frequently means
"sequence of n characters". Here this sense is dubbed
- sets of phrases (n-grams) which are identical except for one word,
dubbed the "wildword" and represented by the wildcard sign *. For
example, at the * of is a
phrase-frame with variants like at the start of, at the end
of, at the heart of
etc. Phrase-frames are useful tools for discovering phraseological patterns.
Guidelines for choosing n-grams or phrase-frames are given in the
tutorials. Parallel to 3-gram etc. this
site uses 3-frame etc. as shorthand for "phrase-frame of three words",
and p-frame is a handy stand-in for phrase-frame.
- lexical units as identified by the BNC's
parser with POS tags, including "multiword units".
"Fused forms" are split up into morphemes, each tagged as a separate word
token. Orthographic variants of the same lexeme (database / data-base,
realise / realize) appear as different lexical units. Compound nouns
written with white-space instead of hyphens are separated into their
components, so data base is treated as two lexical units.
- multiword units
- phrases that function grammatically as single words, e.g. conjunction
so that or preposition in spite of, receive a single
POS tag, so they are treated here as single words.
To make this obvious in search results they are displayed with underscores
instead of spaces: so_that, in_spite_of. To search for multiword units
you must enter them in a single query field and use underscores, not spaces.
Since spaces separate multiple words to match match in queries, the word-form filter in spite of
OR spite OR of.
Lists of multiword units: BNC site
- fused forms
- multiple morphemes written without space in English such as cannot, he'd,
George's are "de-fused" by the parser into can not, he 'd, George 's.
Different POS tags clarify whether 'd stands for had or would
and whether 's comes from is or has, or else represents a possessive.
Lists of fused forms: BNC
- query conditions which focus the matching dataset by "filtering out"
unwanted items. Filtering can be done by word-forms, POS codes and / or
frequency, and multiple forms can be specified to either include or exclude
from the dataset.
- "Words" in the corpus are tagged with one of 57 "Part
Of Speech" codes consisting of three characters; this
list of POS codes explains and gives examples of how these codes are
applied. The PIE database permits searching for specific combinations of POS
codes specified by either choosing from a list or entering directly; wildcards
can be used to match groups of related codes. Occasionally the code UNC
(unclassified) is overused, for example for the ai of ain't,
which is ambiguous but could be assigned manually to the proper form of BE or
- Why do you only support Internet Explorer?
- In this initial phase the
time required to develop and test for multiple browsers would detract from
building the database and user interface. Webmasters report that over
85% of Website visitors use Internet Explorer (IE), and even more have access to IE
on their machine. When this Website is stable
and fully documented I will strive for cross-browser compatibility.
Incidentally, the compact and capable
Opera 7 browser supports most of the IE features on this
site (and starts displaying the result much sooner than IE),
and most functions also work in Netscape versions 7 and higher.
- Why do I see no change in the results pane after editing the query parameters?
- After changing any of the query parameters, click the "Query"
button or press the "Enter" key to start a new query. (The "Next" button in the results
pane continues fetching subsequent chunks of the dataset from your last query.)
- Why do I only see the page heading in the results pane, but no results
Depending on the total number of records that match your specifications you may have to wait up to
5 minutes for results, and your browser may even "time out" while you
are waiting. Queries with no word-form or POS filters and with a low minimum
frequency match the largest datasets and are thus the slowest. Some suggestions
to improve performance are...
- Wait up to five minutes for results display before giving up or clicking the
"Query" button again; launching unnecessary or redundant
queries just slows the server down.
- Narrow your search with word-form and / or POS filters.
- Specify a higher minimum frequency to reduce the dataset size. To
study frequent phrases a cutoff frequency of 1000 (or even 100) gives much
- Choose a larger "chunk" size to minimize total waiting time
-- the additional time required to fetch a larger dataset is
negligible, limited only by connection speed.
- Specify alphabetic sort order if that works for your purposes.
- Use the Opera or Mozilla browser, which display results as the browser
receives them (Internet Explorer waits for all the word data to arrive before
displaying it, while the others build the table incrementally and resize it as
- Why does the "random concordances" function take much
longer to match some phrases than others?
- Ironically, the more frequent a phrase and the words in it are, the longer
it takes to compile a random set of concordances. In addition, this feature
takes advantage of a "fulltext" index, which improves efficiency by excluding
"short" words (< 4 letters) and ones which occur in over half the sentences
(e.g. a, and, the, is, are...). Finding these unindexed words takes
longer than more salient words.
To improve speed, queries consisting mainly or
entirely of such short and frequent words are run against a randomized
database: the sentences are in scrambled order, but matching
proceeds in the same order for each new round of concordances.
(Nevertheless, the "Re-Query" link finds any additional matches of
your search text.) This pseudo-random approach usually provides satisfactory
results in a fraction of the time required for a truly random search.
You can tell which query method was used by the codes at
the bottom right of the results page (ft 'fulltext index', rt
'randomized text', followed by the number of seconds the query took).
- Why does the "random concordances" function return some irrelevant
- This feature is much faster and more efficient without matching by POS
code. In addition the fulltext index mentioned in the previous entry appears
to be overly inclusive in its matches. Spurious matches are the cost of
the greatly increased speed. Without such optimization PIE could not offer
this feature: several simultaneous searches really bog the server down.
Please let me know via e-mail link at the bottom of the page if you would like
the option to match POS codes for this functionality (with a
significant speed penalty).
- Why do results show no matches for a phrase that must be in the BNC?
- This question has many possible answers:
- Is your minimum frequency set too high or your maximum too low? Some
phrases are less frequent than you think, and setting a maximum frequency may
exclude some familiar phrases. (The minimum frequency for inclusion in the
database is 3; there is no maximum.)
- Are you looking for phrases that are too long? Try a smaller value
for n and search for a sub-phrase: 4-, 5- and 6-grams are relatively
- Is your query too specific? Try using some wildcards or wildwords to match
a greater number of word forms.
- If you have specified POS tag filters, are they appropriate for the word
forms you want? Try again with no filters or filtyers with wildcards. If you checked the "exclude"
box, does it make sense?
- If you are an American, did you use the appropriate British
spelling? Orthographic variants (e.g. -ise / -ize) have not been
normalized. If you wish to query for more than one variant, enter both in the
"word form" filter field, separated by a space (normalise normalize),
or else use a wildcard (normali?e).
- Why are there no phrase frames matching my query even though I find several
variants in the database?
- Phrase frames are sets of variants which are identical except for one
word, e.g. all but the second word are the same. Do the variants you have
observed really differ only in the (ordinally) same word?
- If you specify word form or POS tag filters, leave at least one word
unspecified. (You may specify -*- to force the "wildword" to
appear in a specific position, but that
is redundant if the other words are specified.) If you need to specify
something for each word, use the "Explore N-Grams" page instead.
- If you have specified POS tag filters, are they appropriate for the word
forms you want? Try again with no filters. If you checked the "exclude"
box, does it make sense?
- Examine and possibly lower all the frequency filters
- Why can't I filter search results by text-type, i.e. domain, genre and
- Search by text-type will be supported in a future release of the site,
presumably by mid-2004.
- Why don't I see the POS-tag for ___ in the drop-down list?
- These lists are not all-inclusive -- that would limit their usefulness. Rather they offer
a number of "super-categories" as examples of using wildcards and numeric ranges.
Please refer to the
list of POS codes for any word-classes not included in the drop-down box.
- Why can't I save results pages with the "Save Page" or "Save Data" buttons?
- These buttons require the ActiveX file system component and work only with
the Windows version of Internet Explorer 5.x and greater. With this browser your security settings will prevent
saving pages unless you either have
enabled ActiveX components to run automatically or after prompting (in which
case you will be nagged for permission each time). It is potentially unsafe to
allow every site to run any desired components on your
computer. The best solution is to add this site to the browser's "Trusted
Sites" list. (Tools menu > Internet Options... menu > Security tab,
click on the "Trusted sites" icon, then the "Sites" button and add this site
to the list. Uncheck "Require server verification..."), then click "Ok".
On this site ActiveX is used exclusively to save Web pages. Users with security concerns are encouraged to verify this by inspecting the
- Why can't I find common phrases like of course, in spite of?
- Such "multiword units" are treated by the BNC's CLAWS parser as single words.
Enter them in a single word field and replace the spaces with _
(underscore): of_course. Complete list of multiword
- Why can't I find contractions like don't, they're or
possessives like children's, parents'?
- Such "fused forms" are treated by the BNC's CLAWS parser as separate words.
Enter each part in a separate word field: do n't, they 're,
children 's, parents ' . Note that "altered" forms like won't, ain't
are segmented as wo n't, ai n't; the exception can't is
segmented can n't', parallel to cannot > can not. Complete list of fused forms.