語料庫搜尋語言

Typing your own queries gives you greater control over what you search for!!

目前三大線上語料庫架構 online corpus architecture (Corpus Workbench, BNCWeb/CQPWeb, Word Sketch Engine) 主要使用的兩種句法。

Corpus Query Language

(ref. Hoffmann et al. (2008))

  • Query languages used to perform linguistic searches in corpus.

  • This is particularly important when we want to extract lexico-grammatical patterns.

  • There are two formats that can be used to enter queries in BNCweb corpus: the Simple Query Syntax (SQS) and the Corpus Query Syntax (CQP). CQP is more verbose than that of SQS.

Simple Query Syntax

  • Easy to search for a particular word form or phrase in the entire corpus.

  • Be aware of the meta-characters (wildcards) , which have a special function in the query language, (see overview below) and must be escaped by preceding them with a backslash (\) if they are used literally. E.g.,\? will get ?

  • Note: words and punctuation symbols are treated as separate tokens; searches are case-insensitive by default.

  • Contracted forms are split.

he'll                   he 'll
ain't                   ai n't
they've                 they 've
what's up, Sam?         what 's up \, Sam \?
  • To find the patterns, use wirdcards expressions.

First type (Simple query wildcards)

在 SQS 的定義下,所謂的 wildcard 就是有特定功能的標點符號 (punctuation symbols with a special function)。例如:

?  stands for a single, arbitrary character
    e.g., s?ng will find 'sing, sang, sung', etc

*  stands for 0 or more arbitrary characters
    e.g., sing* will find 'sing, sings,singer,single', etc

+  stands for 1 or more arbitrary characters
    e.g., sing+ will find 'sings, singer, single' but not 'sing'.

You could also combine multiple wildcards such as *oo+oo* to find 'Voodoo, shoolroom', etc.

Wildcards can be used freely among the items of a phrase query, but they only apply to single word tokens and do not match across multiple tokens.

Hence, black*white matches 'black-and-white', but not 'black and white'.

Second type: Separate any number of alternatives with commas, and enclose them in square brackets. E.g.,

 hum[our,or]
 humo[u,]r
 humo[ur,r]

finds both the BE and AE spelling.??+[able,ability] will match 'capable, capability, availability', etc.

Matching POS

  • Search for a word form with a specific POS tag by linking them with an underscore _. Wildcards can be applied to both word form and POS tag.

lights_NN2      -> plural noun 'lights', but not the verb form 'lights'
*ly_AJ0         -> adjectives ending in -ly (e.g., 'daily')
super+_V*       -> verb forms starting with 'super'(e.g., 'supervised')
  • POS can be searched alone.

_PNX            -> any reflexive pronoun
  • Search for simplified POS tags with curly braces.

super+_{V}   -> verb forms starting with 'super'
A, ADJ                  adjective 
N, SUBST                noun
V, VERB                 verb
ADV                     adverb
ART                     article 
CONJ                    conjunction
INT, INTERJ             interjection
PREP                    preposition
PRON                    pronoun
$, STOP                 punctuation
UNC                     other / uncertain

Headword and lemma queries

  • A HEADWORD is a set of wordforms consisting of a basic uninflected form and its inflectional variants. E.g., the headword WRITE represents the wordforms write, writes, wrote, writing and written.

  • Headwords do not distinguish between different word classes.Thus the headword PLAY covers both the wordforms of the verb (i.e., play, plays,played) and of the noun (i.e., play and plays).

  • In BNCweb, a lemma is defined as the combination of the HEADWORD and the SIMPLIFIED TAG for a given word. So the lemma play_V represents all the wordforms of the verb 'to play'. To search, use

{show}          -> show, shows, showed, shown, showing
{play/V}
{play/N}

Querying Word Sequences

  • Queries can consist of multiple words, e.g. 'talk of the town'.

  • All tokens (i.e., words and punctuation symbols) are separated by blanks; again possessives (Peter's) and contracted forms (they've, gonna) must be split:

he will \, wo n't he \?     ->  he will, won't he?
  • Each query item in a sequence can make full use of wildcards, part-of- speech constraints, and headword or lemma searches:

{number/N} of _{A} _NN2     -> numbers of younger men, ...
  • Use + to skip an arbitrary token, or * for an optional token. Combine + and * for larger gaps, e.g. +++** to skip between 3 and 5 tokens.

{eat} * up     -> eat up, ate up, eat it up, eaten all up... 
{eat} + up     -> eat it up, eaten all up,...(but not eat up, ate up)

Advanced lexico-grammatical patterns

  • Use regular expression notation (see below) for alternatives, optional elements and repetition within a sequence:

(_{A})?             zero or one adjective
(_{A})*             zero or more adjectives
(_{A})+             one or more adjectives
(_{A}){2,4}         between two and four adjectives
(...|...|...)       matches one of the alternatives indicated by ... 
(...|...|...)*      alternatives with repetition (optional) (...|...|...)+      alternatives with repetition (non-optional) 
(...|...|...){2,4}  between two and four repetitions of the given alternatives (may be mixed in any order)
  • Regular expression notation can be nested to match complex patterns:

the (most _AJ0 | _AJS) {man}

will find 'the biggest men', 'the most attractive man',etc.

  • Complex syntactic patterns can be formed, e.g. for a prepositional phrase:

_{PREP} (_{ART})? ((_{ADV})? _{A})* _{N}

will find "a preposition; followed by an optional article; followed by any number of adjectives (zero or more), each of which may optionally be preceded by an adverb; followed by a noun"

XML tags

Proximity queries

  • Special syntax for searching one item within a specified range of another:

kick <<s>> bucket -> kick and bucket in the same sentence 
{kick/V} <<s>> bucket_NN1 (can use POS/lemma constraints) 

day <<3>> night   -> day and night within range of 3 tokens 
day <<5<< night   -> night ... day (within 5 tokens)
day >>5>> night   -> day ... night (within 5 tokens)
  • Only the left element ("target") will be highlighted on the result page. The right element is considered as a "constraint" that must be satisfied.

  • Multiple constraints can be chained:

{day} <<5>> {month} <<5>> {year}

In this case, day must co-occur with month as well as year in a 5-token window; only day will be highlighted on the Query result page.

  • Proximity queries can be nested with parentheses:

{waste/V} <<s>> (time <<3>> money)

the verb waste must co-occur with time as well as money in the same sentence; but time and money must be closer together (within a 3- token window). Again, only instances of waste will be highlighted.

  • Proximity queries cannot be combined with lexico-grammatical patterns!

Exercise (using BNCWeb)

  1. To boldly split. Traditional prescriptive grammars advise against the use of split infinitives such as the famous to boldly go. Use BNCweb to find out how far actual usage in Present-day English conforms to this prescription.

    • Write a query that matches split infinitives, consulting to find the appropriate pos tags. How many split infinitives can you find in the BNC?

    • Compare this result to the number of prescriptively correct infinitives (boldly to go or to go boldly). Why can't you just search for the pattern to <infinitive> as a point of comparison?

    • Are split infinitives used more often in spoken than in written English?

    • Can you extend your queries to also find (split) infinitives with complex adverbs, such as to at least consider and to sort of say?

Corpus Query Syntax

The Corpus Query Syntax (known as CQP Query Syntax) was developed at the IMS, Uni. of Stuttgart in the early 1990 . The CQP as used in Word Sketch Engine is an extension to the original language.

》更強大的搜尋句法可以造就提出研究問題的深度!

  • Instead of wildcards, CQP makes use of Regular Expressions to search for

    • generalizations (e.g., 'all words that begin with super-')

    • patterns (e.g., 'all words that fit the pattern imp_ss_ble')

    • varieties of lexico-grammatical patterns.

Regular Expression 正則表示法

  • Regex is the compact notation for describing repetition, optionality and alternatives in sequences of characters (word forms and annotations) or sequences of tokens (e.g., lexico-grammatical patterns).

  • It is widely used in computational linguistics for searching and analyzing textual data. Harder but more powerful than Simple Query Syntax.

*****************
Character classes
*****************
.            any character except newline
\w \d \s     word, digit, whitespace
\W \D \S     not word, digit, whitespace
[abc]        any of a, b, or c
[^abc]       not a, b, or c
[a-g]        character between a & g

*****************
Anchors
*****************
^abc         the beginning of the string.
abc$         the end of the string
\b           word boundary

*****************
Escaped characters
*****************
\. \* \\      escaped special characters
\t \n \r      tab, linefeed, carriage return

*****************
Groups and Lookaround
*****************
(abc)         capture group
\1            backreference to group #1
(?:abc)       non-capturing group
(?=abc)       positive lookahead
(?!abc)       negative lookahead

*****************
Quantifiers and Alternation
*****************

a*            0 or more
a+            1 or more    
a?            0 or 1
a{2}          exactly two
a{2,}         two or more
a{1,5}        between one and five
a+? a{2,}?    match as few as possible
ab|cd         match ab or cd
".{3,}(ness(es)?|it(y|ies)|(tion|ment)s?)"

finds the nominalizations ending in -ness, -ity, -ment and -tion as well as their plural forms -nesses, -ities, -ments and -tions.

  • Note !

    • regex in CQP are always case-sensitive, you must add the "ignore case" modifier %c so that "s.ng"%c matches the Song.

    • regex in CQP is used both at the level of characters and at the level of tokens.

      So super.+ matches a word beginning with super- followed by one or more arbitrary characters, while [pos = "AT0"]?[pos = "AJ."]+[pos = "NN."] matches a sequence of an optional article, one or more adjectives, and a common noun.

      SQS allows regex notation only at the level of tokens, while simpler wildcards are used at the level of characters. E.g., super+ for words beginning with "super".

    • There are usually many equally valid ways of looking for the same information. E.g., the following two regex searching for words with more than three vowls in a row , are equivalent:

      ".*(a|e|i|o|u){4,}.*" %c
      ".*[aeiou]{4,}.*" %c
  • There are different dialects of regular expression syntax.CQP implements a version known as POSIX regular expression.

Exercise

a). Write a regex query to find words that follow the orthographic pattern VCCVCCVCCVCC..., i.e., at least four repetitions of a group that is formed by a vowel followed by exactly two consonants. Use character class to match the consonants and vowels.

Comparisons

  • Matching arbitrary substrings (with comparison with SQS)

Notation    Description                       Wildcard(SQS)
========    ==========================        =============
.           any character (exactly one)        ?
            s.ng -> 'sing','sang',...
.?          0 or 1 character                   [?,]
            f.?ee -> 'fee', 'free',...             
.*          0 or more character                *
            work.*  -> 'work', 'works'          
.+          1 or more character                +
            work.+  -> 'works','workshop'
.{n,m}      between n and m characters         
            work.{1,2}  -> 'works', 'worked'   
.{n,}       at least n characters (n or more)  
            work.{4,}  -> 'workings'           ????*
.{0,m}      at most m characters (m or fewer)
            o.{0,2}n  -> 'on', 'own','open'
  • Repitition operators for multi-character substrings:

(...)?          optional substring
                (un)?easy   -> 'easy', 'uneasy'
(...)*          0 or more repetitions of substring
                (anti-)*pop -> 'pop', 'anti-pop', ''anti-anti-pop'
(...)+          1 or more repetitions of substring
                (anti-)+pop -> 'anti-pop', ''anti-anti-pop'
(...){n,m}      between n and m repetitions of substring
                (ha){2,4}
(...){n,}       at least n repetitions of substring
                (ha){3,}
(...|...|...)   alternatives separated by | between parentheses                    ask(s|ed|ing)?  -> 'ask', 'asks', 'asked'...

[abc]           any of a,b,or c

舉個例中說明為何 CQP 比較厲害

Attributes

  • Each token in the BNC is annnotated with a POS tag, headword and various other linguistic information, referred to as ATTRIBUTES of a token.

Token attributes in BNCweb

word        original word form                      Gone
pos         part-of-speech tag                      VVN
hw          headword                                go
class       word class (simplified                  VERB
            pos tag/lemma category)
lemma       combination of headword+word class      go_VERB 
type        type of token (w:word, c:punctuation,   w
            x:missing text
  • Unlike SQS, a corpus query in CQP allows you to access all these attributes in a consistent way.

      [ attribute = "regular expression" ]

    such combination of attribute name and regex is called a constraint, and the complete expression in square brackets - which specifies one or more constraints on a single token - is referred to as a token expression.

  • A token expression consists of matches an attribute name on the left, an operator (such as =), and on the right in quotation marks a pattern (or regex) that the attribute value has to match .The entire expression is enclosed in square brackets [...] indicating that it refers to a single token.

      [hw = 'bright']        > headword 'BRIGHT'
      [pos = 'AJS']          > adjectives in superlative degree
  • Conditions can be combined by using a Boolean operator.

      [(word = "can"%c) & (pos != "VM0")]

    finds the word can (case-insensitive) tagged as anything but a modal verb (i.e., as a noun or lexical verb)

Lexico-grammatical patterns and text strcture

Make use of the automatic translation of SQS into CQP in the QUERY HISTORY function !!!

Advanced features of CQP queries

CQP in Word Sketch Engine

Ref.1 Ref.2

  • A query consists of a regular expression over attribute expressions and/or structures.

  • The attributes could be words and tag.

      [lemma = "bias"] [word = "towards|toward"] []{1,3}[tag= "NN."]

Practice: COPENS and WSE.

BNC tagset can be found here. Also note that pos in corpus has been assigned by automatic tagger and are not always correct.

IMS Corpus Workbench and its Corpus Query Processor can be found at http://cwb.souceforge.net

In computer terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol. Examples of characters include letters, numerical digits, common punctuation marks (such as "." or "-"), and whitespace.

Last updated