These days, large language models (LLMs) are all the rage, with new ones constantly emerging. The majority of these linguistic behemoths, such as ChatGPT and Bard, are trained on text data collected from websites, papers, books, and anything else that contains language. This indicates that their work is a mishmash of genius.
What if LLMs were taught on the dark web instead of the traditional web we all know, though? Together with data intelligence company S2W, researchers at the Korea Advanced Institute of Science and Technology (KAIST) have successfully done so.
The groundbreaking collaboration has brought about a unique generative AI model symbolically dubbed DarkBERT to fight against cybercrime.
What is DarkBERT?
Unlike traditional LLMs trained on data available on the internet, researchers trained DarkBERT on datasets taken exclusively from the dark web. The team used the Tor network – a portal to the dark realms of the web – to integrate the DarkBERT into this clandestine ecosystem and collect extensive raw data to train on.
To address the ethical concerns entangled with dark web content, the team didn’t implement the data as is. Instead, the researchers employed a series of measures including deduplication, data filtering, and pre-processing. These safeguards were implemented to ensure the responsible handling of sensitive information commonly found within the dark web.
Source: IFLScience
During the training process, DarkBERT was fed two distinct sets of data over a span of 16 days. The pre-processed dataset underwent careful redaction, obscuring details. Those might encompass the names of victim organizations, leaked data specifics, threatening statements, and illegal imagery. Among the extensive data, over a thousand pages were categorized as adult entertainment. This reflects the diverse nature of the content encountered within the dark web landscape.
Where does DarkBERT come from?
The new AI tool relies on a distinguished RoBERTa algorithm that Facebook researchers introduced back in 2019. However, while RoBERTaexcelled in predicting obscured text sections within unannotated language samples, DarkBERT’s exceptional prowess goes beyond the conventional realm.
As highlighted in the team’s paper, DarkBERT showcases formidable capabilities in unraveling the intricate tapestry of the dark web. Its advanced functionality extends to navigating the complex nuances and hidden depths typical of this dark digital landscape.
What’s the purpose of DarkBERT?
Despite its seemingly ominous name, DarkBERT is not meant to be used for any evil plans. Rather, it is designed for security and law enforcement applications.
Since it was trained on the dark web, a network of dubious websites where enormous datasets of stolen passwords are frequently discovered, DarkBERT performs better in cybersecurity/CTI applications than previous language models. The creators of the model have shown how to utilize it to find ransomware leak sources.
Hackers and ransomware organizations frequently transfer data to the dark web with the intention of selling it. The data typically includes sensitive information such as passwords and bank information. According to the study, DarkBERT can help security investigators automatically recognize these websites. Plus, it can be used to trawl through the vast number of dark web forums and keep an eye out for any sharing of illegal material.
Source: Difenda
DarkBERT might perform “dark web domain-specific tasks” better than other models. Yet, the researchers recognize that some tasks may need additional fine-tuning because to the lack of publicly available data on the dark web.
Still, DarkBERT envisions a time when training AI models on extremely niche data allows them to be adapted to particular tasks. As opposed to multifunctional ChatGPT and Google Bard, DarkBERT is a specialized tool for thwarting hackers and other malicious actors on the web.
Presently, DarkBERT is still in the developing phase. Due to the tricky and potentially risky nature of data from the dark web, the model will not be available to the public use soon. Yet, it can be requested for the academic and research purposes.
The dark web is often associated with illicit activities, such as the sale of drugs, weapons, stolen data, and other illegal goods and services. It provides a platform for various criminal activities, including hacking, cyberattacks, and the exchange of sensitive information.