Building Compact Entity Embeddings Using Wikidata

Mohamed Lubani; Shahrul Azman Mohd Noah

doi:10.18517/ijaseit.8.4-2.6831

Building Compact Entity Embeddings Using Wikidata

Mohamed Lubani, Shahrul Azman Mohd Noah

Abstract

Representing natural language sentences has always been a challenge in statistical language modelling. Atomic discrete representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be assigned different representations regardless the fact that they share identical words. In this paper, we focus on building the vector representations (embeddings) of named-entities from their contexts to facilitate the task of ontology population where named-entities need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to compensate for the lack of a labelled corpus to build the contexts of all target named-entities as well as all their senses. Description text and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and uninformative features in the embeddings generated from artificially built contexts, we propose a method to build compact entity representations to sharpen entity embeddings by removing irrelevant features and emphasizing the most descriptive ones. An extended version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the most descriptive features about the target entity. The final entity representations are built by compressing the embeddings of the chosen subset using a deep stacked autoencoders model. Cosine similarity and t-SNE visualisation technique are used to evaluate the final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that appear in similar contexts are assigned similar compact vector representations based on their contexts.

Keywords

Entity Embeddings; Entity Vector Representations; Named Entity Disambiguation.

Full Text:

PDF

References

T. Mikolov, K. Chen, G. Corrado and J. Dean, â€œEfficient estimation of word representations in vector space,â€ arXiv preprint arXiv:1301.3781, 2013.

M. A. Taiye, S. S. Kamaruddin and F. K. Ahmad, â€œRepresenting Semantics of Text by Acquiring its Canonical Form,â€ International Journal on Advanced Science, Engineering and Information Technology, vol. 7, no. 3, pp. 808-814, 2017.

S. A. M. Noah, N. Omar and A. Y. Amruddin, â€œEvaluation of lexical-based approaches to the semantic similarity of Malay sentences.,â€ Journal of Quantitative Linguistics, vol. 22, no. 2, pp. 135-156, 2015.

M. Mohd and O. M. A. Bashaddadh, â€œInvestigating the Combination of Bag of Words and Named Entities Approach in Tracking and Detection Tasks among Journalists.,â€ Journal of Information Science Theory and Practice, vol. 2, no. 4, pp. 31-48, 2014.

N. I. Y. Saat and S. A. M. Noah, â€œRule-based Approach for Automatic Ontology Population of Agriculture Domain,â€ Information Technology Journal, vol. 46, no. 51, pp. 46-51, 2016.

Y. I. A. M. Khalid and S. A. M. Noah, â€œSemantic text-based image retrieval with multi-modality ontology and DBpedia,â€ The Electronic Library, vol. 35, no. 6, pp. 1191-1214, 2017.

W. Ammar, G. Mulcaire, Y. Tsvetkov, G. Lample, C. Dyer and N. A. Smith, â€œMassively Multilingual Word Embeddings,â€ arXiv preprint arXiv:1602.01925, 2016.

R. E. Salah and L. Q. b. Zakaria, â€œArabic Rule-Based Named Entity Recognition Systems: Progress and Challenges,â€ International Journal on Advanced Science, Engineering and Information Technology, vol. 7, no. 3, pp. 815-821, 2017.

Z. S. Harris, â€œDistributional structure,â€ Word, vol. 10, no. 2-3, pp. 146 - 162, 1954.

J. R. Firth, â€œA synopsis of linguistic theory 1930-55,â€ in Studies in Linguistic Analysis, Vols. 1952-59, The Philological Society, 1957, pp. 1-32.

M. Sahlgren, â€œThe distributional hypothesis,â€ Italian Journal of Disability Studies, vol. 20, pp. 33-53, 2008.

G. Salton, The SMART Retrieval Systemâ€”Experiments in Automatic Document Processing, NJ: Prentice-Hall, Inc. Upper Saddle River, 1971.

D. E. Rumelhart and J. L. McClelland, Psychological and Biological Models, MIT Press, 1986.

D. E. Rumelhart, G. E. Hinton and R. J. Williams, â€œLearning internal representations by error propagation,â€ in Parallel distributed processing: explorations in the microstructure of cognition, Cambridge, MA, MIT Press Cambridge, MA, 1986.

Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, â€œA Neural Probabilistic Language Model,â€ Journal of Machine Learning Research, vol. 3, pp. 1137-1155, 2003.

T. Mikolov, W.-t. Yih and G. Zweig, â€œLinguistic regularities in continuous space word representations,â€ in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, â€œDistributed representations of words and phrases and their compositionality,â€ in Advances in neural information processing systems, 2013.

M. U. Gutmann and A. HyvÃ¤rinen, â€œNoise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics,â€ The Journal of Machine Learning Research, vol. 13, no. 1, pp. 307-361, 2012.

I. Yamada, H. Shindo, H. Takeda and Y. Takefuji, â€œJoint learning of the embedding of words and entities for named entity disambiguation,â€ arXiv preprint arXiv:1601.01343, 2016.

D. Milne and I. H. Witten, â€œAn effective, low-cost measure of semantic relatedness obtained from Wikipedia links,â€ in In Proceedings of the First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI), 2008.

Z. Wang, J. Zhang, J. Feng and Z. Chen, â€œKnowledge graph and text jointly embedding,â€ in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.

Y. Cao, L. Huang, H. Ji, X. Chen and J. Li, â€œBridging Text and Knowledge by Learning Multi-Prototype Entity Mention Embedding,â€ in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017.

J. G. Moreno, R. Besancon, R. Beaumont, E. D'hondt, A.-L. Ligozat, S. Rosset, X. Tannier and B. Grau, â€œCombining word and entity embeddings for entity linking,â€ in European Semantic Web Conference, 2017.

Freebase, 17 December 2014. [Online]. Available: https://plus.google.com/109936836907132434202/posts/bu3z2wVqcQc.

D. H. Ballard, â€œModular learning in neural networks,â€ in AAAI'87 Proceedings of the sixth National conference on Artificial intelligence, 1987.

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz and L, â€œTensorflow: Large-scale machine learning on heterogeneous distributed systems,â€ arXiv preprint arXiv:1603.04467, 2016.

L. v. d. Maaten and G. Hinton, â€œVisualizing data using t-SNE,â€ Journal of machine learning research, pp. 2579-2605, 2008.

Z. Ibrahim, S. A. M. Noah and M. M. Noor, â€œKnowledge acquisition from textual documents for the construction of medicinal herbs domain ontology,â€ Journal of Applied Science, vol. 9, no. 4, pp. 794-798, 2009.

DOI: http://dx.doi.org/10.18517/ijaseit.8.4-2.6831

Refbacks

There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development