🌟真理の扉

鏡子 (きょうこ)

文字の大きさ

大中小

3,866 / 4,110

第182章　ハンターズハロウィンマイクロブルームーン

インターネット・コーパス

しおりを挟む

Japanese frequency lists (tokenisation, lemmatisation and POS tagging by ChaSen)
lemmas from the Internet corpus
word forms from the Internet corpus
POS frequencies from the Internet corpus
Portuguese frequency lists (tokenisation, lemmatisation and POS tagging by TreeTagger)
lemmas from the Internet corpus
word forms from the Internet corpus
POS frequencies from the Internet corpus
Spanish frequency lists (tokenisation, lemmatisation and POS tagging by TreeTagger)
lemmas from the Internet corpus
word forms from the Internet corpus
POS frequencies from the Internet corpus
There is also a frequency list of Georgian produced by Garold Shmaltsel and Givi Nozadze.

The structure of the lists follows the template of the lemmatised BNC lists produced by Adam Kilgariff, namely:

[word rank] [normalised frequency] [lemma, word form or POS]
Note that the frequency has been normalised to ipm: the number of instances of an individual word or POS tag per million words in respective corpora. Normalisation makes it possible to compare frequencies in the BNC against the Internet corpus. If you want to know the actual number of occurrences of a word listed there, multiply the frequency by the corpus size in million words (the size of a corpus is shown at the top of its frequency list). For instance, browser is used about 8556 times in the English Internet Corpus (47.17*181.376).

Finally, we have lists of distributionally similar words for English, German and Russian (words are said to be distributionally similar, if they share a significant amount of collocates in the corpus). The lists have been produced by Reinhard Rapp using Singular Value Decomposition (SVD).

The lists are distributed under the Creative Commons (CC BY) Attribution license.

Google翻訳

日本語の頻度リスト（ChaSenによるトークン化、レンマ化、品詞タグ付け）
インターネットコーパスからの見出語
インターネットコーパスからの単語形式
インターネットコーパスからのPOS頻度
ポルトガル語の頻度リスト（TreeTaggerによるトークン化、レンマ化、POSタグ付け）
インターネットコーパスからの見出語
インターネットコーパスからの単語形式
インターネットコーパスからのPOS頻度
スペイン語の頻度リスト（TreeTaggerによるトークン化、レンマ化、POSタグ付け）
インターネットコーパスからの見出語
インターネットコーパスからの単語形式
インターネットコーパスからのPOS頻度
GaroldShmaltselとGiviNozadzeによって作成されたグルジア語の頻度リストもあります。

リストの構造は、AdamKilgariffによって作成されたレンマ化されたBNCリストのテンプレートに従います。

[単語ランク] [正規化された頻度] [見出語、単語形式、またはPOS]
頻度はipmに正規化されていることに注意してください。つまり、それぞれのコーパスの100万語あたりの個々の単語またはPOSタグのインスタンスの数です。正規化により、BNCの周波数をインターネットコーパスと比較することができます。そこにリストされている単語の実際の出現回数を知りたい場合は、頻度に100万語単位のコーパスサイズを掛けます（コーパスのサイズは頻度リストの上部に表示されます）。たとえば、ブラウザは英語のインターネットコーパス（47.17 * 181.376）で約8556回使用されています。

最後に、英語、ドイツ語、ロシア語の分布的に類似した単語のリストがあります（コーパス内でかなりの量の連語を共有している場合、単語は分布的に類似していると言われます）。リストは、特異値分解（SVD）を使用してReinhardRappによって作成されました。

リストは、クリエイティブ・コモンズ（CC BY）帰属ライセンスの下で配布されます

●リストは、クリエイティブ・コモンズ（CC BY）帰属ライセンスの下で配布

クリエイティブ・コモンズなら知ってる。

ウィキペディアの下の段に、常に解説がしてあるから。

しおりを挟む