首页 » 文章 » 文章详细信息
Mathematical Problems in Engineering Volume 2019 ,2019-01-10
A Novel Deep Learning Method for Obtaining Bilingual Corpus from Multilingual Website
Research Article
ShaoLin Zhu 1 , 2 , 3 Xiao Li 1 , 2 YaTing Yang 1 , 2 Lei Wang 1 , 2 ChengGang Mi 1 , 2
Show affiliations
DOI:10.1155/2019/7495436
Received 2018-04-03, accepted for publication 2018-12-10, Published 2018-12-10
PDF
摘要

Machine translation needs a large number of parallel sentence pairs to make sure of having a good translation performance. However, the lack of parallel corpus heavily limits machine translation for low-resources language pairs. We propose a novel method that combines the continuous word embeddings with deep learning to obtain parallel sentences. Since parallel sentences are very invaluable for low-resources language pair, we introduce cross-lingual semantic representation to induce bilingual signals. Our experiments show that we can achieve promising results under lacking external resources for low-resource languages. Finally, we construct a state-of-the-art machine translation system in low-resources language pair.

授权许可

Copyright © 2019 ShaoLin Zhu et al. 2019
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

通讯作者

YaTing Yang.The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi, China, cas.cn;Key Laboratory of Speech Language Information Processing of Xinjiang, Urumqi, China.yangyt@ms.xjb.ac.cn

推荐引用方式

ShaoLin Zhu,Xiao Li,YaTing Yang,Lei Wang,ChengGang Mi. A Novel Deep Learning Method for Obtaining Bilingual Corpus from Multilingual Website. Mathematical Problems in Engineering ,Vol.2019(2019)

您觉得这篇文章对您有帮助吗?
分享和收藏
0

是否收藏?

参考文献
[1] P. Koehn, R. Zens, C. Dyer, O. Bojar. et al.Moses: open source toolkit for statistical machine translation. :177-180. DOI: 10.1162/089120105775299168.
[2] T. Mikolov, K. Chen, G. Corrado. (2013). Efficient Estimation of Word Representations in Vector Space. Computation and Language. DOI: 10.1162/089120105775299168.
[3] V. K. Rangarajan Sridhar, L. Barbosa, S. Bangalore. A scalable approach to building a parallel corpus from the Web. :2113-2116. DOI: 10.1162/089120105775299168.
[4] A. Barrón-Cedeño, C. España-Bonet, J. Boldoba, L. Màrquez. et al.A Factory of Comparable Corpora from Wikipedia. :3-13. DOI: 10.1162/089120105775299168.
[5] D. S. Munteanu, D. Marcu. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics.31(4):477-504. DOI: 10.1162/089120105775299168.
[6] R. G. Hussain, M. A. Ghazanfar, M. A. Azam, U. Naeem. et al.(2018). A performance comparison of machine learning classification approaches for robust activity of daily living recognition. Artificial Intelligence Review:1-23. DOI: 10.1162/089120105775299168.
[7] L. Barbosa, V. Sridhar K, M. Yarmohammadi. (2012). Harvesting Parallel Text in Multiple Languages with Limited Supervision. International Conference on Computational Linguistics:201-214. DOI: 10.1162/089120105775299168.
[8] W. Ling, L. Marujo, C. Dyer, A. W. Black. et al.Crowdsourcing High-Quality Parallel Data Extraction from Twitter. :426-436. DOI: 10.1162/089120105775299168.
[9] J. R. Smith, C. Quirk, K. Toutanova. Extracting parallel sentences from comparable corpora using document level alignment. :403-411. DOI: 10.1162/089120105775299168.
[10] M. A. Ghazanfar, S. A. Alahmari, Y. F. Aldhafiri, A. Mustaqeem. et al.(2017). Using machine learning classifiers to predict stock exchange index. International Journal of Machine Learning and Computing.7(2):24-29. DOI: 10.1162/089120105775299168.
[11] F. Grégoire, P. Langlais. A Deep Neural Network Approach To Parallel Sentence Extraction. . DOI: 10.1162/089120105775299168.
[12] C. Tillmann, S. Hewavitharana. An efficient unified extraction algorithm for bilingual data. :2093-2096. DOI: 10.1162/089120105775299168.
[13] C. Chu, T. Nakazawa, S. Kurohashi. Constructing a Chinese-Japanese parallel corpus from wikipedia. :642-647. DOI: 10.1162/089120105775299168.
[14] M. Esplà-Gomis, M. Forcada, S. Ortiz Rojas, J. Ferrández-Tordera. et al.Bitextor's participation in WMT'16: shared task on document alignment. :685-691. DOI: 10.1162/089120105775299168.
[15] V. Papavassiliou, P. Prokopidis, G. Thurmair. (2013). A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. The Workshop on Building & Using Comparable Corpora:43-51. DOI: 10.1162/089120105775299168.
[16] M. Zhang, H. Peng, Y. Liu, H. Luan. et al.Bilingual lexicon induction from non-parallel data with minimal supervision. :3379-3385. DOI: 10.1162/089120105775299168.
[17] A. Antonova, A. Misyurev. (2011). Building a web-based parallel corpus and filtering out machine-translated text. The Workshop on Building Using Comparable Corpora: Comparable Corpora the Web:136-144. DOI: 10.1162/089120105775299168.
[18] A. Khwileh, H. Afli, G. Jones, A. Way. et al.Identifying Effective Translations for Cross-lingual Arabic-to-English User-generated Speech Search. :100-109. DOI: 10.1162/089120105775299168.
[19] S. Gouws, Y. Bengio, G. Corrado. BilBOWA: Fast bilingual distributed representations without word alignments. :748-756. DOI: 10.1162/089120105775299168.
文献评价指标
浏览 49次
下载全文 5次
评分次数 0次
用户评分 0.0分
分享 0次