Dataset package#
This package exposes Datasets of various Samples, both primary (Common Criteria, FIPS) and auxiliary (CVEs, CPEs, …)
This documentation doesn’t provide full API reference for all members of dataset
package. Instead, it concentrates on the Dataset that are immediately exposed to the users. Namely, we focus on CCDataset
, FIPSDataset
and their abstract base class Dataset
.
Tip
The examples related to this package can be found in the common criteria notebook and the fips notebook.
CCDataset#
- class sec_certs.dataset.dataset.Dataset(certs={}, root_dir=PosixPath('/this/is/dummy/nonexisting/path'), name=None, description='', state=None, auxiliary_datasets=None)#
Base class for dataset of certificates from CC and FIPS 140 schemes. Layouts public functions, the processing pipeline and common operations on the dataset and certs.
- class DatasetInternalState(meta_sources_parsed: 'bool' = False, artifacts_downloaded: 'bool' = False, pdfs_converted: 'bool' = False, auxiliary_datasets_processed: 'bool' = False, certs_analyzed: 'bool' = False)#
- analyze_certificates()#
- Does two things:
Extracts data from certificates (keywords, etc.)
Computes various heuristics on the certificates.
- property auxiliary_datasets_dir#
Path to directory with auxiliary datasets.
- property certs_dir#
Returns directory that holds files associated with certificates
- compute_cpe_heuristics()#
Computes matching CPEs for the certificates.
Computes CVEs for the certificates, given their CPE matches.
- convert_all_pdfs(fresh=True)#
Converts all pdf artifacts to txt, given the certification scheme.
- copy_dataset(new_root_dir)#
Copies all dataset files to new_root_dir and adjusts all paths internally. Keeps the artifacts from the original location. :param str | Path new_root_dir: path to directory where the new dataset shall be stored.
- download_all_artifacts(fresh=True)#
Downloads all artifacts related to certification in the given scheme.
- enrich_automated_cpes_with_manual_labels()#
Prior to CVE matching, it is wise to expand the database of automatic CPE matches with those that were manually assigned.
- classmethod from_json(input_path, is_compressed=False)#
Will load ComplexSerializableType from json. :param str | Path input_path: path to load the file from :param bool is_compressed: if True, will decompress .gz first, defaults to False :return T: the deserialized object
- classmethod from_web(archive_url, snapshot_url, progress_bar_desc, path=None, auxiliary_datasets=False, artifacts=False)#
Fetches the fresh dataset snapshot from sec-certs.org.
Optionally stores it at the given path (a directory) and also downloads auxiliary datasets and artifacts (PDFs).
Note
Note that including the auxiliary datasets adds several gigabytes and including artifacts adds tens of gigabytes.
- Parameters:
archive_url – The URL of the full dataset archive.
snapshot_url – The URL of the full dataset snapshot.
progress_bar_desc – Description of the download progress bar.
path – Path to a directory where to store the dataset, or None if it should not be stored.
auxiliary_datasets – Whether to also download auxiliary datasets (CVE, CPE, CPEMatch datasets).
artifacts – Whether to also download artifacts (i.e. PDFs).
- get_keywords_df(var)#
Get dataframe of keyword hits for attribute (var) that is member of PdfData class.
- move_dataset(new_root_dir)#
Moves all dataset files to new_root_dir and adjusts all paths internally. Deletes the artifacts from the original location. :param str | Path new_root_dir: path to directory where the new dataset shall be stored.
- abstract process_auxiliary_datasets(download_fresh=False)#
Processes all auxiliary datasets (CPE, CVE, …) that are required during computation.
- property root_dir#
Directory that will hold the serialized dataset files.
- update_with_certs(certs)#
Enriches the dataset with certs :param List[Certificate] certs: new certs to include into the dataset.
- property web_dir#
Path to certification-artifacts posted on web.
- class sec_certs.dataset.CCDataset(certs={}, root_dir=PosixPath('/this/is/dummy/nonexisting/path'), name=None, description='', state=None, auxiliary_datasets=None)#
Class that holds CCCertificate. Serializable into json, pandas, dictionary. Conveys basic certificate manipulations and dataset transformations. Many private methods that perform internal operations, feel free to exploit them.
- property active_csv_tuples#
Returns List Tuple[str, Path] where first element is name of csv file and second element is its Path. The files correspond to csv files downloaded from CC website that list all active certificates.
- property active_html_tuples#
Returns List Tuple[str, Path] where first element is name of html file and second element is its Path. The files correspond to html files parsed from CC website that list all active certificates.
- property archived_csv_tuples#
Returns List Tuple[str, Path] where first element is name of csv file and second element is its Path. The files correspond to csv files downloaded from CC website that list all archived certificates.
- property archived_html_tuples#
Returns List Tuple[str, Path] where first element is name of html file and second element is its Path. The files correspond to html files parsed from CC website that list all archived certificates.
- property certificates_dir#
Returns directory that holds files associated with the certificates
- property certificates_pdf_dir#
Returns directory that holds PDFs associated with certificates
- property certificates_txt_dir#
Returns directory that holds TXTs associated with certificates
- classmethod from_web_latest(path=None, auxiliary_datasets=False, artifacts=False)#
Fetches the fresh snapshot of CCDataset from sec-certs.org.
Optionally stores it at the given path (a directory) and also downloads auxiliary datasets and artifacts (PDFs).
Note
Note that including the auxiliary datasets adds several gigabytes and including artifacts adds tens of gigabytes.
- Parameters:
path – Path to a directory where to store the dataset, or None if it should not be stored.
auxiliary_datasets – Whether to also download auxiliary datasets (CVE, CPE, CPEMatch datasets).
artifacts – Whether to also download artifacts (i.e. PDFs).
- get_certs_from_web(to_download=True, keep_metadata=True, get_active=True, get_archived=True)#
Downloads CSV and HTML files that hold lists of certificates from common criteria website. Parses these files and constructs CCCertificate objects, fills the dataset with those.
- Parameters:
to_download (bool) – If CSV and HTML files shall be downloaded (or existing files utilized), defaults to True
keep_metadata (bool) – If CSV and HTML files shall be kept on disk after download, defaults to True
get_active (bool) – If active certificates shall be parsed, defaults to True
get_archived (bool) – If archived certificates shall be parsed, defaults to True
- property mu_dataset_dir#
Returns directory that holds dataset of maintenance updates
- property mu_dataset_path#
Returns a path to the dataset of maintenance updates
- property pp_dataset_path#
Returns a path to the dataset of Protection Profiles
- process_auxiliary_datasets(download_fresh=False)#
Processes all auxiliary datasets needed during computation. On top of base-class processing, CC handles protection profiles, maintenance updates and schemes.
- process_maintenance_updates(to_download=True)#
Downloads or loads from json a dataset of maintenance updates. Runs analysis on that dataset if it’s not completed. :return CCDatasetMaintenanceUpdates: the resulting dataset of maintenance updates
- process_protection_profiles(to_download=True, keep_metadata=True)#
Downloads new snapshot of dataset with processed protection profiles (if it doesn’t exist) and links PPs with certificates within self. Assigns PPs to all certificates, based on name and fname match.
- Parameters:
to_download (bool) – If dataset should be downloaded or fetched from json, defaults to True
keep_metadata (bool) – If json related to the PP dataset should be kept on drive, defaults to True
- Raises:
RuntimeError – When building of PPDataset fails
- process_schemes(to_download=True, only_schemes=None)#
Downloads or loads from json a dataset of CC scheme data.
- property reports_dir#
Returns directory that holds files associated with certification reports
- property reports_pdf_dir#
Returns directory that holds PDFs associated with certification reports
- property reports_txt_dir#
Returns directory that holds TXTs associated with certification reports
- property scheme_dataset_path#
Returns a path to the scheme dataset
- property targets_dir#
Returns directory that holds files associated with security targets
- property targets_pdf_dir#
Returns directory that holds PDFs associated with security targets
- property targets_txt_dir#
Returns directory that holds TXTs associated with security targets
- to_pandas()#
Return self serialized into pandas DataFrame
FIPSDataset#
- class sec_certs.dataset.FIPSDataset(certs={}, root_dir=PosixPath('/this/is/dummy/nonexisting/path'), name=None, description='', state=None, auxiliary_datasets=None)#
Class for processing of FIPSCertificate samples. Inherits from ComplexSerializableType and base abstract Dataset class.
- classmethod from_web_latest(path=None, auxiliary_datasets=False, artifacts=False)#
Fetches the fresh snapshot of FIPSDataset from sec-certs.org.
Optionally stores it at the given path (a directory) and also downloads auxiliary datasets and artifacts (PDFs).
Note
Note that including the auxiliary datasets adds several gigabytes and including artifacts adds tens of gigabytes.
- Parameters:
path – Path to a directory where to store the dataset, or None if it should not be stored.
auxiliary_datasets – Whether to also download auxiliary datasets (CVE, CPE, CPEMatch datasets).
artifacts – Whether to also download artifacts (i.e. PDFs).
- process_auxiliary_datasets(download_fresh=False)#
Processes all auxiliary datasets (CPE, CVE, …) that are required during computation.