What is a corpus?
A corpus is simply a large collection of texts - entire documents, sentences or even sentence chunks. Most corpora are monolingual, like the Hungarian National Corpus. For language learners, a more exciting variety are aligned bilingual corpora, which contain sentences along with their equivalents in another language.
The CHDICT website allows you to search nearly 3 million Chinese and Hungarian sentence pairs from the open-source OpenSubtitles2016 corpus.
A substitute for example phrases
CHDICT’s entries do not contain example phrases. It would be beyond the authors’ means to create these and have them proofread in both Chinese and Hungarian. Corpus search, however, allows you to instantly browse a large number of real examples and form an impression about a word’s usage. You can see what other words a Chinese word tends to co-occur with, and in what constructions it is typically used.
Extended dictionary
There are over 60 thousand Hungarian words that occur at least three times among the corpus’s 3 million sentences. The Chinese vocabulary is hard to estimate because of the difficulties of Chinese word segmentation, but even the Hungarian numbers show that the corpus’s vocabulary is several times larger than CHDICT’s coverage. If you don’t find a word in CHDICT, chances are good that you can still decipher its meaning by looking in the corpus.
Highlighted translations
Every result shows your search term with a strong highlight. When searching for Hungarian words, you will first find the exact form you entered, followed by other inflected forms.
The exciting thing about the corpus search on this site is that you also get highlights in the other language, indicating your search term’s likely translation within the full sentence. The stronger this highlight, the more likely it is that the translation is the right one. Keep in mind, though, that these highlights are generated by an algorithm, so they are not guaranteed to be correct. Their main benefit is that they make it easy to quickly scan the search results for equivalents.
When you’re searching for Chinese terms, there is another thing to look out for. The sequence of Chinese characters were segmented into words algorithmically, and often your search text is part of a larger unit within a retrieved sentence. For example:
Although your search text was 电话, in this particular Chinese sentence it occurs as part of a longer, automatically identified unit, 打电话. This is important because the highlighted Hungarian translation refers to the longer fragment, and not your original search term.
Limitations
Although these 3 million Chinese/Hungarian sentence pairs are a genuine treasure trove, it’s important to keep in mind the corpus’s limitations.
- CHDICT entries indicate both the traditional and the simplified characters used to write the headword, but all text in the corpus is in the simplified script.
- Sentence pairs are practically never direct translations of each other. Both the Chinese and the Hungarian most likely derives from a single English original.
- Movie subtitles are often not full sentences, merely sentence chunks.
- Because most movies are produced in the US, the corpus’s content has a strong US cultural bias. Colorado, Charlie and beef jerky are all in there; Hortobágy, Huba and mákos guba are not.
- Movies are overwhelmingly popular culture, so the dialog is typically informal and colloquial. With Hungarian sentences this is further exacerbated by a subtitling tradition that favors an even more slangy register, often with a penchant for the vulgar.
- The creators of the underlying OpenSubtitles corpus aligned the Chinese and Hungarian subtitles with automatic methods, relying largely on timestamps. This sometimes produces false alignments, so don’t be surprised if you occasionally find a pair where the Chinese sentence has nothing to do with the Hungarian, or if there is an extra chunk in one of the two.
- Any digital corpus of this size is bound to have some amount of trash, for instance in the form of English sentences instead of Hungarian, or outright garbage resulting from encoding errors. I discarded nearly half of the original content for such reasons, but even so, some amount of noise still remains.
For all these reasons, you ought to peruse the corpus with a critical attitude and a healthy dose of skepticism. In other words, just as you treat every other source that comes your way.