Malicious Web Request Classification using NLP
Malicious Web Request Classification using NLP
Classifier for maliscious actions in web application firewall log entries.
Key Highlights
- Implemented and trained a custom four-layer BERT model on 45k web application firewall log entries, classifying them into one of nine classes of malicious activities.
- Achieved 85.6% accuracy and 0.90 macro F1-score by using a random forest classifier on top of BERT embeddings.
Objective
The goal of the model is to classify web application firewall log entries into one of nine classes of malicious activities: HOST_Scan, SQL_Injection, Path_Disclosure, Vulnerability_Scan, Leakage_Through_NW, Directory_Indexing, System_Cmd_Execution, Cross_Site_Scripting, and Automatically_Searching_Infor.
Dataset: 45k web application firewall log entries (70% train, 20% validation, 10% test)
Input examples:
- HOST_Scan: “GET /boaform/admin/formLogin?username=user&psd=user HTTP/1.0\r\n\r\n”
- Cross_Site_Scripting: “GET /board/board_view?code=%3Cscript%3Eprompt(document.cookie)%3C/script%3E HTTP/1.1\r\nHost: <www.college.school\r\nAccept-Encoding>: identity\r\nCookie: ID=1094200543; designart_site=lbtbqr99b9n4vr0e2en2p5eoh83idq5i\r\nUser-Agent: python-urllib3/1.26.9\r\n\r\n”
Methodology
The inputs are first pre-processed and tokenized. Then the input tokens are used to train a BERT model. Usning the embeddings from the BERT model, a random-forest classifier is used for the final classification.
Pre-processing
The preprocessing includes the following steps:
- convert everything to lowercase
- replace \r\n into a single space
- decode url encoding, e.g. convert %40 back to @
- add spaces before and after punctuations and special characters
Before: “GET /board/board_view?code=%3Cscript%3Eprompt(document.cookie)%3C/script%3E HTTP/1.1\r\n”
After: “get / board / board _view ? code = < script > prompt ( document . cookie ) < / script > http / 1 . 1 “
Tokenization
wordpiece? fasttext?
BERT Model
Random-Forest Classifier
Results
- 85.6% accuracy
- 0.90 macro F1-score
t-SNE visualization of the BERT embeddings
FIXME, make comment here
Limitations and Future Work
- The main limitation is that the model does not take advantage of any domain knowledge. For example, a log entry containing a
<script>tag is likely a cross-site scripting attack. - By using domain knowledge to engineer rule-based features, these features can be concatenated to the BERT embeddings to improve performance.