The Emille Corpus (Beta Release Version)
In: http://ota.ox.ac.uk/headers/2460.xml
The collection consists of: Thirty million words of monolingual written data (Gujarati, Tamil, Hindi, Punjabi-news website articles); 600,000 words of monolingual spoken data (Hindi, Urdu, Punjabi, Bengali, Gujarati-radio broadcasts); 120,000 words of parallel data in each of English, Hindi, Urdu, Punjabi, Bengali and Gujarati (U.K. government leaflets).