language_detection

Detect the language from a given text.
To do that it generates a language profile based on N-grams for every file in etc
directory.
Then it generate such language profile for the unknown text and compare the previosly language profiles against the unknown.
Requirements:
Only requirement is a PHP version greater than or equal to 7.1.
> Note: language_detection requires the Multibyte String extension in order to work.
Install via Composer
composer require patrick-schur/language-detection
Or add the following to composer.json
{
"require": {
"patrick-schur/language-detection": "*"
}
}
Basic Usage
Before we can recognize the language from a given text, we have to generate a language profile for each language.
From the beginning it comes with a pre-trained language profile (etc/_langs.json
).<br>
Also you can add new files to etc
or change existing ones.
First we have to generate a language profile.
require_once 'vendor/autoload.php';
use LanguageDetector\Trainer;
$t = new Trainer;
$t->learn();
If we have our language profile, we can classify texts by their language.
To detect the language correctly, the length of the input text should be at least some sentences.
require_once 'vendor/autoload.php';
use LanguageDetector\LanguageDetector;
$ld = new LanguageDetector;
var_dump($ld->detect('Das ist ein deutscher Satz.')); // de
Supported languages:
It supports up to now 73 languages.
If your language not supported, feel free to add your own language files.
- ab (abkhaz)
- af (afrikaans)
- am (amharic)
- ar (arabic)
- az (azerbaijani)
- be (belarusian)
- bg (bulgarian)
- bn (bengali)
- co (corsican)
- cs (czech)
- cy (welsh)
- de (german)
- dk (danish)
- el (greek)
- en (english)
- eo (esperanto)
- es (spanish)
- et (estonian)
- eu (basque)
- fa (persian)
- fi (finnish)
- fj (fijian)
- fo (faroese)
- fr (french)
- ga (irish)
- gd (scottish)
- gl (galician)
- gn (guarani)
- ha (hausa)
- he (hebrew)
- hi (hindi)
- hr (croatian)
- hu (hungarian)
- hy (armenian)
- ia (interlingua)
- ig (igbo)
- io (ido)
- is (icelandic)
- it (italian)
- iu (inuktitut)
- jp (japanese)
- jv (javanese)
- ka (georgian)
- ko (korean)
- ku (kurdish)
- la (latin)
- lg (ganda)
- lo (lao)
- lt (lithuanian)
- lv (latvian)
- mh (marshallese)
- mn (mongolian)
- ms (malay)
- mt (maltese)
- nl (dutch)
- no (norwegian)
- nv (navajo)
- pl (polish)
- pt (portuguese)
- ro (romanian)
- ru (russian)
- sk (slovak)
- sl (slovene)
- so (somali)
- sv (swedish)
- th (thai)
- tr (turkish)
- ty (tahitian)
- ug (uyghur)
- uk (ukrainian)
- uz (uzbek)
- vi (vietnamese)
- zh (chinese)