Preprint has been submitted for publication in journal
Preprint / Version 1

Feature Extraction in Topic Modeling Using the Latent Dirichlet Allocation Method in Data Leak Events


Fitur Ekstraksi pada Pemodelan Topik Menggunakan Metode Latent Dirichlet Allocation pada Peristiwa Kebocoran Data

##article.authors##

DOI:

https://doi.org/10.21070/ups.3537

Keywords:

Latent Dirichlet Allocation, Bag of Word, TF-IDF, Data Leak

Abstract

This research aims to find the best extraction features and apply topic modeling from Twitter data regarding personal data leaks, one of the trending topics due to the actions of hacker Bjorka where the data that is spread is important data such as the NIK and SIM cards of the Indonesian people. The research was carried out using the Latent Dirichlet Allocation method using the Bag of Word and TF-IDF extraction features, and the data used consisted of 11,067 tweets from the Twitter platform. Modeling using the BoW extraction feature produces the best coherence score of 0.47 with 3 main topics related to data leaks such as Kominfo protecting personal data, Johnny G Plate being responsible for the data leak case caused by hacker Bjorka and protecting people's personal data through the PDP bill. Meanwhile, with the TF-IDF extraction feature, the best coherence score was 0.47 with 5 main topics.

Downloads

Download data is not yet available.

References

F. S. Gharehchopogh and Z. A. Khalifehlou, “Analysis and Evaluation of Unstructured Data : Text Mining versus Natural Language Processing,” no. April 2022, 2011, doi: 10.1109/ICAICT.2011.6111017.

I. M. K. B. Putra and R. P. Kusumawardani, “Analisis Topik Informasi Publik Media Sosial Di Surabaya Menggunakan Pemodelan Latent Dirichlet Allocation ( Lda ) Topic Analysis of Public Information in Social Media in Surabaya Based on Latent Dirichlet Allocation ( Lda ) Topic Modelling,” J. Tek. Its, vol. 6, no. 2, pp. 2–7, 2017.

A. F. Hadi, D. B. C. W, M. Hasan, and A. D. Penelitian, “TEXT MINING PADA MEDIA SOSIAL TWITTER STUDI KASUS : MASA TENANG PILKADA DKI,” 2017.

T. Akhir, “SIMILARITY BERBASIS WEB RESPONSIVE,” 2018.

I. N. Kabiru and P. K. Sari, “Analisa Konten Media Sosial E-commerce Pada Instagram Menggunakan Metode Sentiment Analysis Dan Lda-based Topic Modeling (studi Kasus: Shopee Indonesia),” eProceedings Manag., vol. 6, no. 1, pp. 12–19, 2019, [Online]. Available: https://openlibrarypublications.telkomuniversity.ac.id/index.php/management/article/view/8498

E. Y. Arifianto, K. F. Digital, P. Studi, I. Program, F. T. Industri, and U. I. Indonesia, “Analisis Topik Data Tindak Kriminal pada Media Sosial Twitter Menggunakan Metode LDA ( Latent Dirichlet Allocation ),” 2020.

I. Komputer, D. Ilmu, F. Matematik, P. Alam, and U. G. Mada, “5936-Article Text-8497-11551-10-20181123,” vol. V, no. September, 2018.

L. A. Wirasakti, R. Permadi, A. D. Hartanto, and H. Hartatik, “Pembuatan Kata Kunci Otomatis Dalam Artikel Dengan Pemodelan Topik,” J. Media Inform. Budidarma, vol. 4, no. 1, p. 27, 2020, doi: 10.30865/mib.v4i1.1707.

P. Studi, T. Informatika, F. Sains, D. A. N. Teknologi, U. Islam, and N. Syarif, “LATENT DIRICHLET ALLOCATION ( LDA ) UNTUK MENGETAHUI TOPIK PEMBICARAAN WARGANET LATENT DIRICHLET ALLOCATION ( LDA ) UNTUK,” 2021.

S. A. Kusnadi, “Perlindungan Hukum Data Pribadi Sebagai Hak Privasi,” AL WASATH J. Ilmu Huk., vol. 2, no. 1, pp. 9–16, 2021, doi: 10.47776/alwasath.v2i1.127.

M. N. Rahman, Analisis performa penggunaan stopwords dan stemming dalam sentimen analisis dengan pendekatan klasifikasi naive bayes. 2022.

R. Watrianthos, M. Giatman, W. Simatupang, R. Syafriyeti, and N. K. Daulay, “Analisis Sentimen Pembelajaran Campuran Menggunakan Twitter Data,” J. Media Inform. Budidarma, vol. 6, no. 1, p. 166, 2022, doi: 10.30865/mib.v6i1.3383.

D. A. Agustina, S. Subanti, and E. Zukhronah, “Implementasi Text Mining Pada Analisis Sentimen Pengguna Twitter Terhadap Marketplace di Indonesia Menggunakan Algoritma Support Vector Machine,” Indones. J. Appl. Stat., vol. 3, no. 2, p. 109, 2021, doi: 10.13057/ijas.v3i2.44337.

A. K. Fauziyyah, “Analisis Sentimen Pandemi Covid19 Pada Streaming Twitter Dengan Text Mining Python,” J. Ilm. SINUS, vol. 18, no. 2, p. 31, 2020, doi: 10.30646/sinus.v18i2.491.

B. R. Aditya, “Penggunaan Web Crawler Untuk Menghimpun Tweets dengan Metode Pre-Processing Text Mining,” J. INFOTEL - Inform. Telekomun. Elektron., vol. 7, no. 2, p. 93, 2015, doi: 10.20895/infotel.v7i2.35.

A. Zein, “Peran Text Processing Dalam Aplikasi Penerjemah Multi Bahasa Menggunakan Ajax API Google,” Sainstech J. Penelit. dan Pengkaj. Sains dan Teknol., vol. 28, no. 1, pp. 19–23, 2018, doi: 10.37277/stch.v28i1.270.

A. F. Hidayatullah, “Pengaruh Stopword Terhadap Performa Klasifikasi Tweet Berbahasa Indonesia,” JISKA (Jurnal Inform. Sunan Kalijaga), vol. 1, no. 1, pp. 1–4, 2016, doi: 10.14421/jiska.2016.11-01.

L. Afuan, “Stemming Dokumen Teks Bahasa Indonesia,” J. Telemat., vol. 6, no. 2, pp. 34–40, 2013.

G. Laboreiro, L. Sarmento, J. Teixeira, and E. Oliveira, “Tokenizing micro-blogging messages using a text classification approach,” Int. Conf. Inf. Knowl. Manag. Proc., no. June 2014, pp. 81–87, 2010, doi: 10.1145/1871840.1871853.

R. A. Wiryawan and N. R. Rosyid, “Pengembangan Aplikasi Otomatisasi Administrasi Jaringan Berbasis Website Menggunakan Bahasa Pemrograman Python,” Simetris, vol. 10, no. 2, pp. 1–12, 2019.

J. Rygl, J. Pomikálek, R. Řehůřek, M. Růžička, V. Novotný, and P. Sojka, “Semantic vector encoding and similarity search using fulltext search engines,” Proc. 2nd Work. Represent. Learn. NLP, Rep4NLP 2017 55th Annu. Meet. Assoc. Comput. Linguist. ACL 2017, pp. 81–90, 2017, doi: 10.18653/v1/w17-2611.

N. Novarian, S. Khomsah, and A. B. Arifa, “LEDGER: Journal Informatic and Information Technology Topic Modeling Tugas Akhir Mahasiswa Fakultas Informatika Institut Teknologi Telkom Purwokerto Menggunakan Metode Latent Dirichlet Allocation,” vol. 2, no. 1, 2023.

J. C. Campbell, A. Hindle, and E. Stroulia, “Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data,” Art Sci. Anal. Softw. Data, vol. 3, pp. 139–159, 2015, doi: 10.1016/B978-0-12-411519-4.00006-9.

D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic evaluation of topic coherence,” NAACL HLT 2010 - Hum. Lang. Technol. 2010 Annu. Conf. North Am. Chapter Assoc. Comput. Linguist. Proc. Main Conf., no. June, pp. 100–108, 2010.

Posted

2023-10-17