Dataset package#
This package exposes Datasets of various Samples, both primary (Common Criteria, FIPS) and auxiliary (CVEs, CPEs, …)
This documentation doesn’t provide full API reference for all members of dataset
package. Instead, it concentrates on the Dataset that are immediately exposed to the users.
Namely, we focus on CCDataset
, FIPSDataset
, ProtectionProfileDataset
and their abstract base class Dataset
.
Tip
The examples related to this package can be found in the common criteria notebook, the protection profile notebook, and the fips notebook.
Base Dataset#
- class sec_certs.dataset.dataset.Dataset(certs=None, root_dir=None, name=None, description='', state=None, aux_handlers=None)#
Base class for dataset of certificates from CC and FIPS 140 schemes. Layouts public functions, the processing pipeline and common operations on the dataset and certs.
- class DatasetInternalState(meta_sources_parsed: 'bool' = False, artifacts_downloaded: 'bool' = False, pdfs_converted: 'bool' = False, auxiliary_datasets_processed: 'bool' = False, certs_analyzed: 'bool' = False)#
- analyze_certificates()#
- Does two things:
Extracts data from certificates (keywords, etc.)
Computes various heuristics on the certificates.
- property auxiliary_datasets_dir#
Path to directory with auxiliary datasets.
- property certs_dir#
Returns directory that holds files associated with certificates
- convert_all_pdfs(fresh=True)#
Converts all pdf artifacts to txt, given the certification scheme.
- copy_dataset(new_root_dir)#
Copies all dataset files to new_root_dir and adjusts all paths internally. Keeps the artifacts from the original location.
- Parameters:
new_root_dir (str | Path) – path to directory where the new dataset shall be stored.
- download_all_artifacts(fresh=True)#
Downloads all artifacts related to certification in the given scheme.
- classmethod from_json(input_path, is_compressed=False)#
Will load ComplexSerializableType from json. :param str | Path input_path: path to load the file from :param bool is_compressed: if True, will decompress .gz first, defaults to False :return T: the deserialized object
- classmethod from_web(archive_url=None, snapshot_url=None, progress_bar_desc=None, path=None, auxiliary_datasets=False, artifacts=False)#
Fetches the fresh dataset snapshot from sec-certs.org.
Optionally stores it at the given path (a directory) and also downloads auxiliary datasets and artifacts (PDFs).
Note
Note that including the auxiliary datasets adds several gigabytes and including artifacts adds tens of gigabytes.
- Parameters:
archive_url – The URL of the full dataset archive. If None provided, defaults to cls.FULL_ARCHIVE_URL.
snapshot_url – The URL of the full dataset snapshot. If None provided, defaults to cls.SNAPSHOT_URL.
progress_bar_desc – Description of the download progress bar. If None, will pick reasonable default.
path – Path to a directory where to store the dataset, or None if it should not be stored.
auxiliary_datasets – Whether to also download auxiliary datasets (CVE, CPE, CPEMatch datasets).
artifacts – Whether to also download artifacts (i.e. PDFs).
- get_certs_by_name(name)#
Returns list of certificates that match given name.
- get_keywords_df(var)#
Get dataframe of keyword hits for attribute (var) that is member of PdfData class.
- property is_backed#
Returns whether the dataset is backed by a directory.
- move_dataset(new_root_dir)#
Moves all dataset files to new_root_dir and adjusts all paths internally. Deletes the artifacts from the original location.
- Parameters:
new_root_dir (str | Path) – path to directory where the new dataset shall be stored.
- process_auxiliary_datasets(download_fresh=False, **kwargs)#
Processes all auxiliary datasets (CPE, CVE, …) that are required during computation.
- property root_dir#
Directory that will hold the serialized dataset files.
- update_with_certs(certs)#
Enriches the dataset with certs :param List[Certificate] certs: new certs to include into the dataset.
- property web_dir#
Path to certification-artifacts posted on web.
CCDataset#
- class sec_certs.dataset.CCDataset(certs=None, root_dir=None, name=None, description='', state=None, aux_handlers=None)#
Class that holds
sec_certs.sample.cc.CCCertificate
samples.Serializable into json, pandas, dictionary. Conveys basic certificate manipulations and dataset transformations. Many private methods that perform internal operations, feel free to exploit them.
The dataset directory looks like this:
├── auxiliary_datasets │ ├── cpe_dataset.json │ ├── cve_dataset.json │ ├── cpe_match.json │ ├── cc_scheme.json │ ├── protection_profiles │ │ ├── reports │ │ │ ├── pdf │ │ │ └── txt │ │ ├── pps │ │ │ ├── pdf │ │ │ └── txt │ │ └── dataset.json │ └── maintenances │ ├── certs │ │ ├── reports │ │ │ ├── pdf │ │ │ └── txt │ │ └── targets │ │ ├── pdf │ │ └── txt │ └── maintenance_updates.json ├── certs │ ├── reports │ │ ├── pdf │ │ └── txt │ ├── targets │ │ ├── pdf │ │ └── txt │ └── certificates │ ├── pdf │ └── txt └── dataset.json
- property active_csv_tuples#
Returns List Tuple[str, Path] where first element is name of csv file and second element is its Path. The files correspond to csv files downloaded from CC website that list all active certificates.
- property active_html_tuples#
Returns List Tuple[str, Path] where first element is name of html file and second element is its Path. The files correspond to html files parsed from CC website that list all active certificates.
- property archived_csv_tuples#
Returns List Tuple[str, Path] where first element is name of csv file and second element is its Path. The files correspond to csv files downloaded from CC website that list all archived certificates.
- property archived_html_tuples#
Returns List Tuple[str, Path] where first element is name of html file and second element is its Path. The files correspond to html files parsed from CC website that list all archived certificates.
- property certificates_dir#
Returns directory that holds files associated with the certificates
- property certificates_pdf_dir#
Returns directory that holds PDFs associated with certificates
- property certificates_txt_dir#
Returns directory that holds TXTs associated with certificates
- get_certs_from_web(to_download=True, keep_metadata=True, get_active=True, get_archived=True)#
Downloads CSV and HTML files that hold lists of certificates from common criteria website. Parses these files and constructs CCCertificate objects, fills the dataset with those.
- Parameters:
to_download (bool) – If CSV and HTML files shall be downloaded (or existing files utilized), defaults to True
keep_metadata (bool) – If CSV and HTML files shall be kept on disk after download, defaults to True
get_active (bool) – If active certificates shall be parsed, defaults to True
get_archived (bool) – If archived certificates shall be parsed, defaults to True
- process_auxiliary_datasets(download_fresh=False, skip_schemes=False, **kwargs)#
Processes all auxiliary datasets (CPE, CVE, …) that are required during computation.
- property reports_dir#
Returns directory that holds files associated with certification reports
- property reports_pdf_dir#
Returns directory that holds PDFs associated with certification reports
- property reports_txt_dir#
Returns directory that holds TXTs associated with certification reports
- property targets_dir#
Returns directory that holds files associated with security targets
- property targets_pdf_dir#
Returns directory that holds PDFs associated with security targets
- property targets_txt_dir#
Returns directory that holds TXTs associated with security targets
- to_pandas()#
Return self serialized into pandas DataFrame
ProtectionProfileDataset#
- class sec_certs.dataset.ProtectionProfileDataset(certs=None, root_dir=None, name=None, description='', state=None, aux_handlers=None)#
Class for processing
sec_certs.sample.protection_profile.ProtectionProfile
samples.Inherits from ComplexSerializableType and base abstract Dataset class.
The dataset directory looks like this:
├── reports │ ├── pdf │ └── txt ├── pps │ ├── pdf │ └── txt └── dataset.json
- extract_data()#
Extracts pdf metadata and keywords from converted text documents.
- get_certs_from_web(to_download=True, keep_metadata=True, get_active=True, get_archived=True, get_collaborative=True)#
Fetches list of protection profiles together with metadata from commoncriteriaportal.org
- get_pp_by_pp_link(pp_link)#
Given URL to PP pdf, will retrieve ProtectionProfile object in the dataset with the link, if such exists.
- property pps_dir#
Path to actual protection profiles.
- property pps_pdf_dir#
Path to pdfs of protection profiles
- property pps_txt_dir#
Path to txts of protection profiles.
- process_auxiliary_datasets(**kwargs)#
Dummy method to adhere to Dataset interface. ProtectionProfile dataset has currently no auxiliary datasets. This will just set the state auxiliary_datasets_processed = True
- property reports_dir#
Path to protection profile reports.
- property reports_pdf_dir#
Path to pdfs of protection profile reports.
- property reports_txt_dir#
Path to txts of protection profile reports.
- property web_dir#
Path to directory with html sources downloaded from commoncriteriaportal.org
FIPSDataset#
- class sec_certs.dataset.FIPSDataset(certs=None, root_dir=None, name=None, description='', state=None, aux_handlers=None)#
Class for processing of
sec_certs.sample.fips.FIPSCertificate
samples.Inherits from ComplexSerializableType and base abstract Dataset class.
The dataset directory looks like this:
├── auxiliary_datasets │ ├── cpe_dataset.json │ ├── cve_dataset.json │ ├── cpe_match.json │ └── algorithms.json ├── certs │ └── targets │ ├── pdf │ └── txt └── dataset.json