🌟真理の扉

鏡子 (きょうこ)

文字の大きさ

大中小

4,072 / 4,110

非公開文章を公開します。　何故か、突然ロシアに繋がる。

コーパスで、日本語が仲間外れになっている？

しおりを挟む

Monolingual corpora

English

I-EN, a corpus of about 160 million words. This corpus has been compiled automatically from the Internet in 2005 along with other Internet corpora (for Chinese, French, German, Italian, Spanish, Polish and Russian).
I-EN-CC, a corpus of about 160 million words consisting of pages labeled with a Creative Commons (CC BY) License. This means that the collection can be downloaded and reused according to the terms of conditions set by the authors.
The British National Corpus (BNC), a classic collection of samples of modern British English, 100 million words.
the Reuters corpus, a collection of newswires from Reuters for one year from 1996-08-20 to 1997-08-19, 90 million words.
A corpus of British News, a collection of newsstories from 2004 from each of the four major British newspapers: Guardian/Observer, Independent, Telegraph and Times, 200 million words.
Russian

The Russian National Corpus, a collection of texts comparable to the BNC in its design, its pilot version has 100 million words (a more elaborated description of the project is available in Russian from "http://ruscorpora.ru)
Russian Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
a corpus of Russian newspapers, 78 million words (Izvestia, Trud and Strana.ru).
the Russian Standard, a corpus of modern Russian fiction with manual disambiguation of morphological categories, 1.6 million words.
The interface to Russian corpora is available from http://corpus.leeds.ac.uk/ruscorpora.html
Chinese

Chinese Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
a fragment of LDC Chinese Gigaword corpus, 35 million words, tokenised and lemmatised using the NEUCSP tool from NLP Lab, North-Eastern University, China; the selection includes newswires for one year (2001); this makes it comparable to the Reuters corpus.
Guo Jin's Chinese PH corpus, which is based on XINHUA news from 1990; segmentation done by Chris Brew and Julia Hockenmaier, 2,5 million words.
Lancaster Corpus of Mandarin Chinese, a corpus of about 1 mln words, which is comparable in its design to Brown and LOB type corpora. Created by Tony McEnery and Richard Xiao, distributed by the European Language Resources Association (Cat. No ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).
The interface to Chinese corpora is available from http://corpus.leeds.ac.uk/query-zh.html
Multilingual aligned corpora

English-Russian, Russian-English fiction; a small parallel corpus of English and Russian fiction from the 19th century (aligned by A. Kretov, Voronezh);
English-German corpus of European Parliament Proceedings; source texts were taken from Phil Köhn's page
German-English Parallel Corpus "de-news"; also taken from Phil Köhn's page
English-Japanese corpus of Yomiuri data (it is available in-house only)
Internet corpora

There are few large general corpora of the size of BNC (100 million words) available. Within Wacky (Web as Corpus) project we developed a set of procedures for collecting Internet corpora from the Internet and collected large representative corpora for for Arabic, Chinese, French, German, Italian, Spanish, Polish and Russian with the search interface available from http://corpus.leeds.ac.uk/internet.html.
The query interface to all corpora is powered by the IMS Corpus Workbench, but it has been extended to simplify processing of some frequent cases, in particular, querying for lemmas and for exact word forms (all corpora have word, pos and lemma attributes, even if the latter is redundant for Chinese). Other possibilities include calculation of most significant collocations (using MI, T and loglikelihood scores) and searching for similar contexts in English, German and Russian corpora.

The interface was developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.

Google翻訳

単一言語コーパス

英語

I-EN、約1億6000万語のコーパス。このコーパスは、他のインターネットコーパス（中国語、フランス語、ドイツ語、イタリア語、スペイン語、ポーランド語、ロシア語）とともに、2005年にインターネットから自動的にコンパイルされました。
I-EN-CC、クリエイティブ・コモンズ（CC BY）ライセンスでラベル付けされたページで構成される約1億6000万語のコーパス。これは、著者が設定した条件に従ってコレクションをダウンロードして再利用できることを意味します。
現代イギリス英語のサンプルの古典的なコレクションであるBritishNational Corpus（BNC）、1億語。

ロイターコーパス、1996-08-20から1997-08-19までの1年間のロイターからのニュースワイヤーのコレクション、9千万語。

英国のニュースのコーパス、4つの主要な英国の新聞のそれぞれからの2004年からのニュース記事のコレクション：ガーディアン/オブザーバー、インディペンデント、テレグラフアンドタイムズ、2億語。

ロシア

ロシア国立コーパスは、その設計においてBNCに匹敵するテキストのコレクションであり、そのパイロットバージョンには1億語が含まれています（プロジェクトのより詳細な説明は、ロシア語で「http://ruscorpora.ru」から入手できます）。
ロシア語インターネットコーパス、約9000万語のコーパス。このコーパスは、他のインターネットコーパスとともに、2005年2月から4月にインターネットから自動的にコンパイルされました。
ロシアの新聞のコーパス、7800万語（イズベスチヤ、トラッド、Strana.ru）。
ロシア語標準、形態学的カテゴリーの手動による曖昧性解消を備えた現代ロシア小説のコーパス、160万語。
ロシアのコーパスへのインターフェースはhttp://corpus.leeds.ac.uk/ruscorpora.htmlから入手できます。

中国語

中国語インターネットコーパス、約9000万語のコーパス。このコーパスは、他のインターネットコーパスとともに、2005年2月から4月にインターネットから自動的にコンパイルされました。
中国東北大学のNLPラボのNEUCSPツールを使用してトークン化および語彙化されたLDC中国語ギガワードコーパスのフラグメント、3500万語。選択には、1年間（2001年）のニュースワイヤーが含まれます。これにより、ロイターコーパスに匹敵します。

1990年の新華社のニュースに基づいた郭金の中国PHコーパス。クリス・ブリュとジュリア・ホッケンマイヤーによるセグメンテーション、250万語。
北京語のランカスターコーパス。約100万語のコーパスで、デザインはブラウンやLOBタイプのコーパスに匹敵します。 TonyMcEneryとRichardXiaoによって作成され、European Language Resources Association（Cat。NoELRA-W0039）とOxford Text Archive（Cat。No2474）によって配布されました。
中国語コーパスへのインターフェースはhttp://corpus.leeds.ac.uk/query-zh.htmlから入手できます。
多言語対応のコーパス

英語-ロシア語、ロシア語-英語フィクション; 19世紀の英語とロシア語のフィクションの小さな対訳コーパス（A. Kretov、Voronezhによって整列）。
欧州議会議事録の英語-ドイツ語コーパス。ソーステキストはPhilKöhnのページから取られました
ドイツ語-英語パラレルコーパス「de-news」; PhilKöhnのページからも引用
読文データの日英コーパス（社内のみ入手可能）
インターネットコーパス

BNC（1億語）のサイズの大きな一般的なコーパスはほとんどありません。 Wacky（Web as Corpus）プロジェクトでは、インターネットからインターネットコーパスを収集するための一連の手順を開発し、httpから利用できる検索インターフェイスを使用して、アラビア語、中国語、フランス語、ドイツ語、イタリア語、スペイン語、ポーランド語、ロシア語の大規模な代表的なコーパスを収集しました。：//corpus.leeds.ac.uk/internet.html。
すべてのコーパスへのクエリインターフェイスはIMSCorpus Workbenchを利用していますが、特に、補題や正確な単語形式のクエリなど、頻繁に発生するケースの処理を簡素化するように拡張されています（すべてのコーパスには、単語、位置、および補題の属性があります。後者が中国語にとって冗長である場合）。他の可能性には、最も重要なコロケーションの計算（MI、T、および対数尤度スコアを使用）や、英語、ドイツ語、ロシア語のコーパスでの同様のコンテキストの検索が含まれます。

インターフェイスはSergeSharoffによって開発されました。さらに質問がある場合は、s.sharoffleeds.ac.ukまでご連絡ください。

以下、自分なりにまとめてみた。

主役は、英語

他のインターネットコーパスは、

中国語

フランス語

ドイツ語

イタリア語

スペイン語

ポーランド語

ロシア語

大規模な代表的なコーパスを収集したそうです。：//corpus.leeds.ac.uk/internet.html

英語-ロシア語、ロシア語-英語フィクション; 19世紀の英語とロシア語のフィクションの小さな対訳コーパス（A. Kretov、Voronezhによって整列）。

●欧州議会議事録の英語-ドイツ語コーパス。

●ソーステキストはPhilKöhnのページから

httpから利用できる検索インターフェイスを使用して

アラビア語

中国語

フランス語

ドイツ語

イタリア語

スペイン語

ポーランド語

ロシア語の大規模な代表的なコーパスを収集したそうです。
：//corpus.leeds.ac.uk/internet.html。

日本に関する情報、ひとつだけ見つけた。

読文データの日英コーパス（社内のみ入手可能）

◎日英コーパスは、社内のみ入手可能だそうです。

やっぱり、日本語は仲間外れでした。

しおりを挟む