Skip to content

Focused on gathering and analyzing security vulnerabilities across various languages. Aimed at being a large dataset for security vulnerabilities from many data sources to eventually feed a model to.

License

Notifications You must be signed in to change notification settings

Rubix982/VulnData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

VulnData - A Security Vulnerability Dataset Collector

VulnData is a project aimed at gathering an extensive dataset of security vulnerabilities across various programming languages. This project will scrape data from various sources, consolidate it, and present it in a unified format to train a language model that can help identify potential security flaws in code. The goal is to build a comprehensive resource for security professionals, researchers, and developers who want to understand the vulnerabilities in their codebase.

Project Overview

The primary objective of VulnData is to:

  • Scrape data: Collect security vulnerability data from various trusted sources.
  • Data consolidation: Merge, process, and format the collected data into a structured and usable form.
  • Vulnerability mapping: Categorize vulnerabilities by programming language, vulnerability type, and common exploitation methods.
  • Training a model: Use the data to train a model to identify patterns and generate feedback for developers to improve the security of their code.

Core Features

  • Web Scraping: Automates the process of scraping vulnerability data from multiple sources, including CVE databases, blog posts, and security advisories.
  • Data Collection: Collects security-related information and maps vulnerabilities to their corresponding programming languages and severity levels.
  • Vulnerability Classification: Classifies vulnerabilities based on their type (e.g., SQL injection, buffer overflow, XSS) and the severity of the issue.
  • Dataset Formatting: Consolidates the scraped data into a structured, machine-readable format (JSON, CSV, etc.) for easy model training.
  • Security Model Training: After collecting sufficient data, this project will leverage the dataset to train a machine learning model focused on identifying and categorizing code vulnerabilities.

Installation

Follow the steps below to install and run VulnData locally.

Prerequisites

  • Rust

Clone the repository

git clone https://github.com/yourusername/vulndata.git
cd vulndata

Build the Project

To install dependencies and build the project, follow the instructions for your chosen language.

For Rust:

cargo build --release

Running the Project

Once built, run the project with the following command:

./vulndata --scrape

This will begin scraping data, consolidating it into the specified format, and initiate training the model (once sufficient data has been collected).

Project Structure

  • /scrapers: Contains code for scraping data from various security sources (e.g., CVE databases, blog posts, security advisories).
  • /data: The collected raw data before processing.
  • /models: Scripts for training and saving machine learning models.
  • /scripts: Utility scripts for cleaning and processing the data.
  • /output: Processed datasets in structured formats (JSON, CSV).
  • /docs: Documentation on data sources, security classifications, and model training procedures.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Future Plans

  • Additional Language Support: Extend the dataset to include vulnerabilities in more programming languages.
  • More Advanced Models: Train a more advanced model to identify vulnerabilities directly in code and suggest fixes.
  • Web Interface: Build a user-friendly web interface to query the vulnerability dataset and get real-time analysis on code snippets.

About

Focused on gathering and analyzing security vulnerabilities across various languages. Aimed at being a large dataset for security vulnerabilities from many data sources to eventually feed a model to.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published