This page describes a variety of simple clues one can use to determine the language a document is written in with high accuracy.

Table of contents

Characters

You can recognize text in a foreign language by looking up characters specific to that language. For some reason this is often more accurate than language recognition software, which pays little attention to the characters.

  • ABCDEFGHIJKLMNOPQRSTUVWXYZ (Latin alphabet)
    • and no other - English language, Zulu language, Japanese language in Romaji (see below), Indonesian language, Hawaiian language
    • ÆØÅæøå - Danish language, Norwegian language
    • ÅÄÖåäö - Swedish language
    • àéëï - Dutch language
    • ĉĈĝĜĥĤĵĴŝŜŭŬ - Esperanto
    • àâçéèêîïôœùû - French language
    • ÄÖÜäöüß - German language
    • àèìòù - Italian language
    • ãõçáéíóúâêîôûàèìòù qü (Brazilian) (k, w and y not in native words) - Portuguese language
    • áéíñÑóúü ¡¿ - Spanish language
    • ÀÇÉÈÍÓÒÚàçéèíóòú· - Catalan language
    • ÁÉÍÓÖŐÚÜŰáéíóöőúüű - Hungarian language
    • ĂÎÂŞŢăîâşţ - Romanian language
    • çÇğıİöÖşŞüÜ - Turkish language
    • ą, ć, ę, ł, ń, ó, ś, ź, ż Polish language
    • ČŠŽ
      • and no other - Slovenian language
      • ĆĐ - Bosnian language, Croatian language
      • ÁĎÉĚŇÓŘŤÚŮÝáďéěňóřťúůý - Czech language
    • ả ạ ấ ầ ẩ ẫ ậ ắ ằ ẳ ẵ ặ ẹ ẻ ế ề ể ễ ệ ỉ ĩ ị ỏ ọ ổ ỗ ộ ủ ụ ỷ ỹ ỵ đ – Vietnamese
  • БДЖИЛПУЦЧШ (Cyrillic alphabet)
    • ЙЩЬЮЯ
      • ҐЄЇ - Ukrainian language
      • Ъ - Bulgarian language
    • ЉЊЏ (Vuk Karadzic's reform)
      • ЋЂ - Serbian language
      • ЃЌЅ - Macedonian language
    • Old Church Slavonic
    • In Transnistria, Romanian is written in Cyrillic characters
  • ΔΘΛΨΩαβγδεζηθικλμνξπρςστυφχψω (Greek Alphabet)
    • Greek language
  • אבגדהוזחטיכלמנסעפצקרשת (Hebrew alphabet)
    • and maybe some odd dots and lines - Hebrew language
    • Yiddish
    • Ladino
  • الصفحة الرئيسية - Arabic alphabet
    • Arabic, Persian, Malay (Jawi), Kurdish, Panjabi, Pashto, Sindhi, Urdu, others
  • 日本語勉強 - East Asian Languages
    • and no other - Chinese language
    • with あいうえお Hiragana and/or アイウエオ Katakana - Japanese language
    • with characters like 위키백과에 - Korean language
    • Vietnamese uses Latin alphabet – see above
  • ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏ etc. -- ㄓㄨㄧㄋㄈㄨㄏㄠ (Zhuyin)
    • ㄪㄫㄬ -- not Mandarin

You can also recognise languages (especially those written in Latin text) by looking for common words / letter patterns.

e.g.

Artificial languages

Esperanto

  • words: de, la, al
  • additional letters: ĉĈĝĜĥĤĵĴŝŜŭŬ
  • words ending in -o, -a, -oj, -aj, -as

Klingon

  • When written in the Latin alphabet Klingon has the unusual property of a distinction in case; "q" and "Q" are different letters. This causes a large number of words that look quite strange to people who aren't used to it, for example: "yIDoghQo'", "tlhIngan Hol".

Lojban

  • starts with "ni'o" or ".i" (or "i");
  • has many words like "ko'a" "pi'o" etc;
  • all lowercase;
  • usually no punctuation except for dots.

Written with (possibly extended) Latin alphabet

Romance languages

Lots of Latin roots.

French

  • words: de, la, le, du, des, il, et;
  • words ending in -x, especially -aux or -eux;
  • many apostrophised contractions, i.e. words beginning with l' or d'
  • accented letters: à â ç è é ê î ô û, rarely ë ï, but never á í ì ó ò ú, and ù only in the word

Spanish

  • characters: ¿ ¡ (inverted question and exclamation marks), ñ
  • word endings: -o, -a, -ción, -miento, -dad
  • angle quotation marks: « » (though "curly-Q" quotation marks are also used)

Catalan

  • character combination "l·l"
  • word endings: -o, -a, -es, ció, -tat

Romanian

  • characters: ă â î ş ţ
  • words: şi, de, la, a, ai, ale, alor, cu
  • word endings: -a, -ă, -u, -ul, -ţie (or -ţiune), -ment, -tate
  • Note that Romanian is sometimes written online with no diacritics, making it harder to identify

Portuguese

  • Common one-letter words: a, à, e, é, o
  • Common two-letter words: ao, as, às, da, de, do, os, um
  • Common three-letter words; aos, das, dos, ele, ela, por, que, uma, ums
  • Common endings: -ção, -ções
  • Most singular words end in vowels. Other singular words end in l, m, r, z
  • Plural words end in s

Germanic languages

Dutch

  • letter sequences "ij", "aa";
  • words: het, op, een, voor (and compounds of voor).

German

  • umlauts (ä, ö, ü), eszet (ß)
  • common words: der, die, das, er, sie, es, ist, und, oder, aber
  • common endings: -en, -er, -ern, -st
  • long compound words
  • many capitalized words in the middles of sentences

Slavic languages

Polish

  • unusual consonant clusters "rz", "szcz", "prz", "trz";
  • words "i", "w";
  • word "się".

Czech

  • Visual abundance of letters "ž,š,ů,ě,ř";
  • words "je", "v".

Japanese in Romaji

  • words: "desu", "masu", "aru", "suru", esp. at end of sentences;
  • letters: nearly 50% vowels (a e i o u);
  • letters: no consonants, except "n" and "h", at end of words

Hungarian

  • words "a", "az", "ez", "egy", "és", "van"

Finnish

  • diacritics used: only ä and ö, but never õ
  • common words: sinä, on
  • common endings: -nen, -ka/-kä, -in
  • common letter combinations: , ei, äi
  • unusually high degree of letter duplication, both vowels and consonants

Estonian

  • similar to Finnish, except:
  • diacritics used: ä, ö, ü, õ, š, ž
  • words end in consonants more frequently than in Finnish

Vietnamese

  • Roman characters with many diacritical marks on vowels. See above.
  • Almost all written words are quite short (one syllable).
  • Words beginning with "ng"

Minnan in Pe̍h-oē-jī

  • Many hyphenated words.
  • Roman characters with many diacritical marks on vowels. Unlike Vietnamese each character has at most one such mark.
  • Unusual combining characters, namely · (middle dot, always after "o") and | (vertical bar). - (macron) is also common.

Chinese Mandarin

Pinyin

  • See Pinyin;
  • You may notice numbers after words; they represent tones.

Greek

Modern Greek is written with Greek alphabet in monotonic, polytonic or atonic, either according to Demotic (Mr. Triantafilidis) grammar or Katharevousa grammar. Some people write in Greeklish (Greek with Latin script) which is either Visual-based, orthographic or phonetic or just messed-up (mixed). The only official forms of Greek language are the Monotonic and Polytonic.

Normal Modern Greek (Greek Monotonic)

  • words "και", "είναι";
  • Each multi-syllable word has one accent/tone mark (oxia): ά έ ή ί ό ύ ώ
  • The only other diacritic ever used is the trema: ϊ/ΐ, ϋ/ΰ, etc.

Ancient or pre-1980's Greek (Greek Polytonic)

  • This is Katharevousa or some mixed form of Demotiki (Triantafilidis' grammar) and Katharevousa;
  • You will notice several accents/tones. Examples: ~ ` and oxia (looks like 'ί);
  • You may also notice this: ΐ, ΰ. ϊ, ϋ etc.

Greek Atonic

  • Was common in some Greek media (television);
  • You will see Greek characters without accents/tones;
  • words: "και, ειναι, αυτο".

Greek in Greeklish

  • Automated conversion software for Greeklish->Greek conversion exists. If you notice a Greeklish text it may be useful for the Greek el.Ireland Information Guide (after conversion).
  • Keep in mind: in Greeklish more than one characters may be used for one letter. (example: th for theta).

Orthographic Greeklish

  • words "kai", "einai".

Phonetic Greeklish

  • words "ke", "ine";
  • omega appears as o;
  • ei, oi appear as i;
  • ai appears as e.

Visual-based Greeklish

  • omega (Ω or ω) may appear as W or w;
  • epsilon (E) may appear as "3";
  • alpha (A) may appear as "4";
  • theta (Θ) may appear as "8";
  • upsilon (Y) may appear as "\|/";
  • More than one characters may be used for one letter.

Messed-up (Mixed) Greeklish

  • words "kai", "eine";
  • combines principles of phonetic, visual-based and orthographic Greeklish according to writer's idiosyncracy;
  • The most commonly used form of Greeklish.

add your language here

[Someone can add for other languages...]


Advertise your
website with
:

Irish Website
Advertising
Can you help us? Are the recent changes correct?
Hosted in Ireland at the Servecentric Dublin Colocation Datacenter
This article is licensed under the GNU Free Documentation License.
It uses material from the Wikipedia article of the same name which can be found here