2. Data

2.1. Properties

2.1.1. datasets

(array of objects) datasets can contain several dataset-objects. A dataset-object is composed of several fields.

2.1.1.1. name

(string) The name that identifies the dataset. By default we use a source_year nomenclature. Example:

"name": "mawi-2015"

2.1.1.2. availability

(string) It establishes how a normal Internet user can access the specific dataset. Please, consider carefully if the dataset-accesibility fits any of the following default labels (values):

  • "public"
  • "public_on_demand"
  • "private"
  • "lost_source" when the paper provides the source/link of the dataset but this is not valid any more.

Example:

"availability": "public"

2.1.1.3. format (optional)

(string) It specifically addresses if the dataset contains packet or flow descriptions. Therefore, the options by default are: "packet" and "flow". Example:

"format": "flow"

2.1.1.4. types (optional)

(array of strings) It specifically addresses if the dataset has been pre-filtered and only contains some types of data based on protocols, versions, etc. Consider labels (values) as filter keys (e.g., if "ipv4" is used, there is no need to add "tcp" or "udp" too). Please, check if the dataset-type fits any of the following default labels (values):

  • "ip"
  • "ipv4"
  • "ipv6"
  • "tcp"
  • "http"
  • "udp"
  • "icmp"
  • "dns"
  • "tls"
  • "ipsec"

Note

The most general should be used when all of its subsets are used. For example, ["ipv4", "ipv6"] is the same as ["ip"].

Example:

"types": ["ipv4"]

2.1.1.5. generation (optional)

(string) It contains information about how the dataset was generated. Please, consider carefully if the dataset-generation fits any of the following default labels (values):

  • "captured" when the dataset has been directly captured from network sensors.
  • "synthetic" when the dataset has been generated by algorithms for artificial traffic generation. This includes capturing data in the network, if the packets were algorithmically generated.
  • "altered_captured" when the dataset is modeled/based on real captures, but manipulated to fulfill some specific criteria (e.g., increase the presence of certain attacks). Also includes datasets generated by running the actual application and capturing its traffic.
  • "mixed" whenever real captures or capture-based traffic is mixed with synthetic traffic.

Note

There is a slight interception between "synthetic" and "altered_captured". Hopefully common sense is enough to disambiguate between them for each paper. If this is not the case for a particular paper, a consensus among experts is necessary.

Example:

"generation": "captured"

2.1.1.6. generation_year

(numerical or array of numberical) The year the dataset was captured or generated. Example:

"generation_year": 2015

2.1.1.7. covered_period (optional)

(string) It tries to give an approximate impression about the time covered by the used dataset during analysis. As a criterion, if the covered_period is below two times the unity, the selected label should be the immediately below, e.g., if the dataset covers 90 minutes, covered_period should be "minutes"; however, if the dataset covers 120 minutes, covered_period should be "hours". Please, consider carefully if the covered period fits any of the following default labels (values):

  • "minutes"
  • "hours"
  • "days"
  • "weeks"
  • "months"
  • "years"

Example:

"covered_period": "hours"

2.1.1.8. details (optional)

(array of string) Suitable to make a record of special characteristics of the dataset that are worth considering in meta-analysis. Please, consider carefully if any of the following default labels (values) are applicable:

  • "raw" data is shown as came directly from sensors or generators with no shape/format transformation. Includes both packet captures (e.g., tcpdump) and flow records (e.g., NetFlow).
  • "preprocessed" data has been transformed/mapped during a preprocessing step. Such preprocessing must have changed somehow the data format, for example, transforming it in structured vectors (i.e., filtered data is still "raw").
  • "no_payload" when payload has been removed from data. Payload removal does not make data preprocessed.

When no relevant details exist, write "none".

Example:

"details": ["raw", "no_payload"]

2.1.1.9. subsets

(array of strings) The dataset might consist of diverse subsets. Here we specify which subsets have been used during the analysis. If it is not clearly specified in the paper with a proper name, the default nomenclature of the subsets refer to the date if possible (format: hh-dd-mm-yyyy). If there are no relevant subsets, write "none", and if the subsets are not specified (or if it is not clear whether subsets are used or not), write "missing". Example:

Note

You can also use this field when a dataset has been divided into constant time pieces (for example, when a one-hour long dataset was divided into 60 1-second long datasets)

"subsets": ["03-11-2014", "30-06-2015", "27-12-2016"]

2.1.1.10. anonymized (optional)

(boolean) Whether the dataset is anonymized or not.

Example:

"anonymized": true

2.2. JSON example (data, complete)

"data": {
  "datasets": [
    {
      "name": "mawi-2015",
      "availability": "public",
      "format": "packet",
      "types": "ip",
      "generation": "captured",
      "generation_year": 2015,
      "covered_period": "minutes",
      "details": ["raw","no_payload"],
      "subsets": ["01-01-2015","15-04-2015","31-07-2015"]
    },
    {
      "name": "kddcup-1999",
      "availability": "public",
      "format": "packet",
      "types": "ipv4",
      "generation": "altered_captured",
      "generation_year": 1999,
      "covered_period": "missing",
      "details": ["preprocessed"],
      "subsets": ["original","original_10_percent","corrected"],
      "anonymized": true
    }
  ]
}