導論

語料庫語言學是語言學中一門探究應用經驗方法在語言分析的學門。

語料是什麼

linguistic data vs corpus data
Forms: signal and symbol
Types: written, spoken, multimodal, etc.
Static vs Dynamic (on the move).

更基本的認識應該是了解，語言是什麼。很可惜的，這個問題沒有一致的解答。但是倒是可以從處理的過程逐漸的逼近。

比方說，處理語料的時候就會面臨單位切割的需要。但是實際觀察大量語言使用之後，比較可以理解語言單位（如詞，詞組，句子等）是「語言學的概念」，不是那麼的理所當然。

Linguists often take it for granted that there us such a thing as a sentence because the data they are working with normally shows it, data which is in almost all cases written or otherwise recorded languages, even in the case of transcriptions of spoken language.
Sentences are linear arrangements of words to which a syntactic structure can be assigned and which feature, in most Indo-European languages, a finite verb in their main clause.... The notion .... owing itself to the introduction of writing. Writing presupposes standardisation much more than an oral language because a written text must stand for itself, it must be interpretable even in the absence of the writer. (W.Teubert, 2010. Rethinking Corpus Linguistics)

語料庫是什麼

有語言標記訊息的數位化語言資料庫。素材本身是中性的 (theory-independent)，但是標記訊息一定是主觀的。
通常搭配語料搜尋與分析工具。

(這個時代的) 語料庫語言學

A way of doing linguistics by looking for structures and patterns in the data; but 需要符應時代精神：Big data, Crowd-sourcing, Hack and Make, Collective Intelligence, Individual computing etc.
巨量資料下的建構與分析方法論。
- corpus-based/corpus-driven (cf. supervised/unsupervised paradigm in machine learning).
- Sampling size and Inference
認知神經心理整合
- A wide range of empirical methods have been employed for investigating the relationship between observable behaviour and underlying mental/neural process.
數位人文、歷史與社會科學整合

語料庫與語言探究的經驗方法，在這個世代的語言學相關的研究社群中已經成為核心的研究方法之一。大部分語言相關的研究在不同程度範圍內都會應用到語料庫工具。從時代背景來看，隨著社交媒體與社會網路的發展，非結構性的文本資料所佔比例已經遠超過結構性的表格性資料，使得文本的語言分析在資訊發展中的角色顯得愈來愈重要。駕馭大文本數據的需要，也造就了語料庫工具不僅在語言與教學研究，也在社會科學、神經心理與認知科學研究上扮演著愈來愈重要的角色。

Previous動機 Next語料處理方法論

Last updated 4 years ago