Generation of a Synthetic Dataset for the Study of Fraud through Deep Learning Techniques

Marco Sánchez, Verónica Olmedo, Carlos Narvaez, Myriam Hernández, Luis Urquiza-Aguiar


Fraud is defined as any purposeful or deliberate act including cunning, deception, or other unfair means to deprive someone of property or money. Nowadays, fraud-related activities are growing at a dizzying rate, causing substantial economic losses every year. For an adequate analysis of this phenomenon, it is necessary to have data that evidences this behavior. Even so, given that these data are scarce and difficult to find, generating synthetic data for their study is a viable option. We designed two algorithms to generate text to create a synthetic data set that allows fraud analysis. These algorithms rely on the Fraud Triangle Theory proposed by Donald R. Cressey and use Recurrent Neural Network (RNN) and Long Short-Term Memory Networks (LSTM), respectively. The datasets generated were analyzed from the semantic point of view, giving a score about their readability and grammar consistency. The results obtained from this evaluation indicate that the data generation architecture proposed using the LSTM algorithm provides better performance in sentence readability (efficiency greater than 70%) than RNN (less than 40%). With LSTM, it was possible to synthesize a comprehensive data set related to the fraud triangle's vertices.  This will make it easier to investigate fraudulent actions that are linked to human behavior. We will present a fraud predictor system based on machine learning techniques in the future.


Fraud triangle theory; machine learning; deep learning; LSTM; RNN.

Full Text:


References, “Acfe association of certified fraud examiners capítulo españa,†Available:, acceded 06-11-2019.

D. Cressey, Other people's money. Montclair, NJ: Patterson Smith, 1973.

S. Y. J. Guan, R. Li and X. Zhang, "A method for generating synthetic electronic medical record text, ieee/acm transactions on computational biology and bioinformatics," Available: 10.1109/tcbb.2019.2948985, accedido 06-11-2019.

R. DeMilli and A. Offutt, "Constraint-based automatic test data generation, ieee transactions on software engineering, vol. 17, no. 9, pp. 900-910, 1991." Available: 10.1109/32.92910, acceded 06-11-2019.

P. T. M. Mann, O. P. Sangwan and S. Singh, "Automatic goal-oriented test data generation using a genetic algorithm and simulated annealing, in international conference-cloud system and big data engineering confluence. ieee, 1 2016."

S. Rani and B. Suri, "An approach for test data generation based on genetic algorithm and delete mutation operators, in international conference on advances in computing and communication engineering. ieee, 2015."

T. L. G. Albuquerque and M. Magnor, "Synthetic generation of high- dimensional datasets, ieee transactions on visualization and computer graphics, vol. 17, no. 12."

P. R. Bing Wang and K. Mueller, "Sketchpadn-d: Wydiwyg sculpting and editing in high-dimensional space," ieee transactions on visualization and computer graphics, vol. 19, no. 12."

E. W. B. C. Kwon, H. Kim and A. E. J. Choo, H. Park, "Axisketcher: Interactive nonlinear axis mapping of visualizations through user draw- ings, ieee transactions on visualization and computer graphics, vol. 23, no. 1, pp. 221–230, 1 2017."

Y. Y. T. R. Liu, B. Fang and P. P. Chan, "Synthetic data generator for classification rules learning," in international conference on cloud computing and big data (ccbd). ieee, 11 2016, pp. 357–361."

B. S. P. Lin, D. J. A. Cipolone, C. R. S. Cox, and R. X. D. Holt, "Development of a synthetic dataset generator for building and testing information discovery systems," in international conference on information technology: New generations (itng). ieee, 2006, pp. 707–712."

P. L. D. Jeske, R. X. C. Rendon, and B. Samadi, "synthetic data generation capabilities for testing data mining tools, in milcom. ieee, 2006, pp. 1–6. "

M. A. A. M. Pasinato, C. E. Mello and G. Zimbrao, "Generating synthetic data for context-aware recommender systems, in 2013 brics congress on computational intelligence and 11th brazilian congress on computational intelligence. ieee, 9 2013, pp. 563–567."

D. Garcia and M. Millan, "A prototype of synthetic data generator, in colombian computing congress (ccc). ieee, 5 2011, pp. 1–6."

M. K. F. Brodkorb, A. Kuijper, and T. V. Landesberger, "a modular rule- based visual interactive creation of tree-shaped geo-located networks," in international conference on signal-image technology & internet-based systems (sitis). ieee, 2016, pp. 397–403.

X. Ying and X. Wu, "graph generation with prescribed feature constraints," in siam international conference on data mining. philadel- phia, pa: Society for industrial and applied mathematics, 4 2009, pp. 966–977.

Y. L. Can Yang, Sixuan Ren and G. H. Houwei Cao, Qihu Yuan, "personalized channel recommendation deep learning from a switch sequence - ieee journals & magazine", Available:, accedido 07-07-2020.

X. Z. Brian C. Hosler, C. C. Owen Mayer, and M. C. S. James A. Shack- leford, "the video authentication and camera identification database: A new database for video forensics - ieee journals & magazine", Available:, accedido 07-07-2020.

Y. Z. Hangxia Zhou, K. Y. Lingfan Yang, Qian Liu, and Y. Du, "short- term photovoltaic power forecasting based on long short-term memory neural network and attention mechanism - ieee journals magazine", Available:, accedido 07-07-2020.

M. A. Hamdi Altaheri and G. Muhammad, "date fruit clas- sification for robotic harvesting in a natural environment using deep learning - ieee journals & magazine", Available:, accedido 07-07-2020.

S. W. S. Ahmadreza Argha, Ji Wu and B. G. Celler, "blood pressure estimation from beat-by-beat time-domain features of oscillometric waveforms using deep-neural-network classification models - ieee journals & magazine", Available:, accedido 07-07-2020.

C. L. Xun Zhu and D. Ji, "keyphrase generation with copy- net and semantic web - ieee journals & magazine", Available:, accedido 07-07-2020.

N. M. Rabiu Abdullahi, "Fraud triangle theory and fraud diamond theory. understanding the convergent and divergent for future research," Available: 10.6007/IJARAFMS/v5-i4/1823, accedido 06-11-2019.

A. R. Fernández, “los datos sintéticos, la clave para mejorar la inteligencia artificialâ€, Addison Wesley college, 1997.â€

P. P. S. Kotsiantis, "mixture of expert agents for handling imbalanced datasets", annals of mathematics, computing tele informatics, vol 1, no 1 (46-55), 2003."

I. Mufti Mahmud, Senior Member, I. Mohammed Shamim Kaiser, Senior Member, I. Amir Hussain, Senior Member, and S. Vassanelli, "Applications of deep learning and reinforcement learning to biological data, ieee transactions on neural networks and learning systems, vol. 29, no. 6, 2018," Available: 10.1109/TNNLS.2018.2790388, accedido 06-11-2019.

D. J. Matich, “Redes neuronales: Conceptos básicos y aplicaciones.â€, informática aplicada a la ingeniería de procesos – orientación i.

J. Z. Ruiqin Bai, X. L. Dengao Li, and B. Z. Qiang Wang, "Rnn- based demand awareness in smart library using crfid, ieee china communications, vol. 17, no. 5 2020)," Available:10.23919/JCC.2020.05.021, accedido 11-12-2020.

F. D. D. Rav´ı, C. Wong, J. A. M. Berthelot, and G. Y. B. Lo, "deep learning for health informatics", ieee journal of biomedical and health informatics 21 (1) (2017) 4–21.

S. D. S.Dhananjay Kumar, "Prediction of depression from eeg sig- nal using long short term memory(lstm), ieee 3rd international conference on trends in electronics and informatics (icoei)," Available: 10.1109/ICOEI.2019.8862560, accedido 06-11-2019.

H. K. E. Lundin and E. Jonsson, "a synthetic fraud data generation methodology", accedido 07-07-2020.

AuditNet, "using key word analysis of an organization's big data for error and fraud detection". url analytics.

Huang, Shaio Yan & Lin, Chi-Chen & Chiu, Ann & Yen, David. (2017). Fraud detection using fraud triangle risk factors. Information Systems Frontiers. 19. 10.1007/s10796-016-9647-9.

Raghunathan, T., Reiter, J. and Rubin, D. (2003). Multiple Imputation for Statistical Disclosure Limitation. Journal of Official Statistics, 19(1), pp.1-16.

Taub, Jennifer & Elliot, Mark & Sakshaug, Joseph. (2020). The Impact of Synthetic Data Generation on Data Utility with Application to the 1991 UK Samples of Anonymised Records. Journal of Transactions on Data Privacy. 13. 1-23.

Golovko, Vladimir & Kroshchanka, A. & Treadwell, D. (2016). The nature of unsupervised learning in deep neural networks: A new understanding and novel approach. Optical Memory and Neural Networks. 25. 127-141. 10.3103/S1060992X16030073.

Zhang, Jianjing & Wang, Peng & Yan, Ruqiang & Gao, Robert. (2018). Long short-term memory for machine remaining life prediction. Journal of Manufacturing Systems. 48. 10.1016/j.jmsy.2018.05.011.



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development