3. Preprocessing

3.1. Properties

3.1.1. performed_feature_selection

(boolean) It states if a feature selection process is carried out in the paper to select the suitable set of features for the analysis. Example:

"performed_feature_selection": true

3.1.2. packet_analysis_oriented

(boolean) It states if, after the preprocessing phase, the data to analyze is intended to be explored packet by packet (e.g., by methods that perform deep packet inspection). Example:

"packet_analysis_oriented": false

3.1.3. flow_analysis_oriented

(boolean) It states if, after the preprocessing phase, the data to analyze is intended to be explored flow by flow. Example:

"flow_analysis_oriented": true

3.1.4. flow_aggregation_analysis_oriented

(boolean) It states if, after the preprocessing phase, the data to analyze has been aggregated according to either features or flows. Therefore, the final analysis will not explore flows or packets, but usually networks as a whole by studying the aggregated values (e.g., time series showing the use of network resources). Example:

"flow_aggregation_analysis_oriented": false

3.1.5. tools

(array of objects) Here we describe the tools used for the preprocessing (data extraction, feature generation and transformations).

Example:

"tools": [
    {
        "name": "tshark",
        "detail": "v2.0.0",
        "availability": "public"
    },
    {
        "name": "own_python_scripts",
        "detail": "none",
        "availability": "private"
    },
    {
        "name": "own_perl_scripts",
        "detail": "none",
        "availability": "private"
    }
]

3.1.5.1. name

(string) We use the following keys for the nomenclature:

  1. if they are released tools, software or suites, they must be appear with the corresponding name; e.g., tshark, silk, tcpdump.
  2. if they consist on scripts or plugins for well-known programming languages, suites, packages or environments, the name must reflect such dependency; e.g., matlab_scripts, java_scripts, python_scripts.
  3. only the top-dependency must be shown (e.g., matlab). Additional relevant packages running under the same environment should be also added as tools.
  4. names start with own_ if they are presented in the paper or referred to previous publications by the same authors (and they do not fit case ‘a.’); e.g., own_matlab_scripts.

Example:

"name": "tshark"

3.1.5.2. detail (optional)

(string) This field expresses important details about the referred tools (e.g., version, release). If no details are required, "none" should be written in the corresponding place. Example:

"detail": "v2.0.0"

3.1.5.3. availability

(strings) This field expresses the availability of the referred tool. Please, consider carefully the following default labels (values):

  • "public"
  • "private"
  • "public_on_demand"
  • "commercial"

Example:

"availability": "public"

3.1.6. normalization_type

(string) This field saves information about possible normalization of numerical data. "no" stands for cases where no normalization is applied but numerical attributes are used. "not_applicable" is for cases where normalization makes no sense (e.g., all analyzed fields are nominal or categories). Please, consider carefully the following default labels (values):

"no", "not_applicable", "range", "zscore", "decimal_scaling", "quartile"

Note

do not confuse "quartile" with "quantile". "quartile" normalization uses Q1 (25th percentile) and Q3 (75th percentile) for normalization.

Example:

"normalization_type": "range"

3.1.7. transformations

(array of strings) This field collects all transformations that are performed after the dataset retrieval and previous to the analysis phase (i.e., they are part of the data preparation). Please, consider carefully the following listed operations (values):

"sampling", "filtering", "log", "map", "graph", "feature_aggregation", "flow_extraction", "entropy", "time_series", "feature_operation", "class_separation"

Note

This field is redundant with the features in the packets/flows/flow aggregations. However, this field is mandatory while the feature fields are optionals.

Example:

"transformations": ["sampling", "flow_extraction", "class_separation"]

3.1.8. final_data_format

(string) It collects the format of data after the preprocessing and previous to the analysis phase. Please, consider carefully the following default labels (values):

  • "numerical_vectors"
  • "nominal_vectors"
  • "mixed_vectors"
  • "strings"
  • "time_series"

Example:

"final_data_format": "numerical_vectors"

3.1.9. feature_selections (optional)

(array of objects) feature_selections can contain several feature_selection-objects. When no flow_aggregation-object exists in the paper, write "none". A feature_selection-object is composed of several fields:

3.1.9.1. name

(string) The name that identifies the feature selection technique. Example:

"name": "forward_selection"

3.1.9.2. type (optional)

(string) It identifies the type of feature selection method. Please, consider carefully the following default labels (values):

  • "wrapper" see a description here.
  • "filter" see a description here.
  • "hybrid" see a description here.
  • "nest" when it embeds or operates in a higher level than other nested methods.
  • "feature_reduction" when it refers to methods that change the space and transform the initial set of features into a new set of features with less dimensions (e.g., PCA, LDA).

Example:

"type": "wrapper"

3.1.9.3. classifier (optional)

(string) It identifies the wrapped classifier that is used to evaluate the subset performance. If classifier is not applicable (e.g., for filters), write "none". Example:

"classifier": "naive_bayes"

3.1.9.4. role (optional)

(string) This field is meaningful when diverse feature selection methods are compared. Default values are: "main", when the method led to the best solutions; and "competitor" for other cases. If only one feature selection method is used, it is always "main". Example:

"role": "main"

3.1.10. packets (optional)

(array of objects) packets can contain several packet-objects. A packet-object is defined when analysis in the paper are conducted on packets, i.e., analysis tools check packets independently or/and packet contents. Use this if you have a feature-vector for each packet. When no packet-object exists in the paper, write "none". A packet-object is composed of several fields:

3.1.10.1. selection (optional)

(string) It identifies how the features extracted to analyze packets were selected. Please, consider carefully the following default labels (values):

  • "in_dataset" if the analyzed feature set is exactly the same feature set of the dataset before preprocessing.
  • "feature_selection" if a feature selection process was conducted and led to the current feature subset.
  • "study_based" if the selected features are taken from a previous study referred in the paper.
  • "tool_based" if the selected features are obtained from an extraction or preprocessing tool.
  • "expert_knowledge" if the selection of features is endorsed by reasoning and proper explanations in the paper.

Example:

"selection": "in_dataset"

3.1.10.2. role (optional)

(string) This field is meaningful when diverse preprocessing methods are compared.

Default values are:

  • "main" when the method led to the best solutions.
  • "validation" for the specific case of packets, when packet inspection is used as baseline or ground truth for validating flow-based analysis.
  • "intermediate" if this method of aggregation is only used as an intermediate step on the way to further aggregation. E.g. if flows are used but only for the purpose of being aggregated to flow aggregations.
  • "competitor" otherwise.

Example:

"role": "validation"

3.1.10.3. main_goal (optional)

(string) This field saves the main goal of preparing the data according to this packet-based format. Please, consider the following possible labels (values):

  • "anomaly_detection"
  • "application_classification"
  • "attack_classification"
  • "botnet_detection"
  • "classification_for_qos"
  • "classification_of_encrypted_traffic"
  • "ddos_detection"
  • "dos_detection"
  • "http_intrusion_detection"
  • "network_properties_monitoring"
  • "p2p_botnet_detection"
  • "p2p_traffic_classification"
  • "probe_detection"
  • "remote_to_local_detection"
  • "specific_malware_detection"
  • "traffic_classification"
  • "traffic_rate_prediction"
  • "traffic_visualization"
  • "user_to_root_detection"

Example:

"main_goal": "traffic_classification"

3.1.10.4. features (optional)

(array of objects)

Describes the features used in the paper. See Features for complete information.

3.1.11. flows (optional)

(array of objects) flows can contain several flow-objects. A flow-object is defined when analysis in the paper are conducted on flows, i.e., analysis tools check the behaviour of connection and connection attempts. Use this if you have a feature-vector for each flow. When no flow-object exists in the paper, write "none". A flow-object is composed of several fields:

3.1.11.1. selection (optional)

3.1.11.2. role (optional)

3.1.11.3. main_goal (optional)

3.1.11.4. active_timeout (optional)

(numerical, in seconds) This field defines the maximum duration of a flow. Example:

"active_timeout": 60

3.1.11.5. idle_timeout (optional)

(numerical, in seconds) This field defines the time in which, if no activity has been detected, the flow is considered as finished. Example:

"idle_timeout": 5

3.1.11.6. bidirectional (optional)

(boolean) This field marks if transmissions between two devices A and B are considered monodirectional (false), i.e., A>B and A<B are two different flows; or bidirectional (true), i.e., A>B and A<B belong to the same flow . Example:

"bidirectional": true

3.1.11.7. features (optional)

(see features)

3.1.11.8. key_features (optional)

(array of objects)

Describes the features used to aggregate the packets. That is, packets which share these features will be put in the same flow. In case all packets should be in the same flow, use "none".

For the features, see features.

3.1.12. flow_aggregations (optional)

(array of objects) flow_aggregation can contain several flow_aggregation-objects. A flow_aggregation-object is defined when analysis in the paper are conducted on aggregation of features or flows, i.e., analysis tools usually describe networks as a whole. Use this if you have a feature-vector for each set of flows. When no flow_aggregation-object exists in the paper, write "none". A flow_aggregation-object is composed of several fields:

3.1.12.1. selection (optional)

Like in packet-object.selection.

3.1.12.2. role (optional)

Like in packet-object.role.

3.1.12.3. main_goal (optional)

Like in packet-object.main_goal.

3.1.12.4. active_timeout (optional)

Like in flow-object.active_timeout.

3.1.12.5. bidirectional (optional)

Like in flow-object.bidirectional.

3.1.12.6. features (optional)

(see features)

3.1.12.7. key_features (optional)

(array of objects)

Describes the features used to aggregate the flows. That is, flows which share these features will be put in the same flow aggregation. In case all flows should be in the same flow, use "none".

For the features, see features.

3.2. JSON example (preprocessing, complete)

"preprocessing": {
  "performed_feature_selection": true,
  "packet_analysis_oriented": false,
  "flow_analysis_oriented": true,
  "flow_aggregation_analysis_oriented": false,
  "tools": [
      {
          "tool": "tshark",
          "detail": "v2.0.0",
          "availability": "public"
      },
      {
          "tool": "own_perl_scripts",
          "detail": "none",
          "availability": "private"
      }
  ],
  "normalization_type": "range",
  "transformations": ["flow_extraction","log","time_series", "feature_operation", "class_separation"],
  "final_data_format": "numerical_vectors",
  "feature_selections": [
      {
          "name": "max-relevance min-redundancy filter (correlation and MI based)",
          "type": "filter",
          "classifier": "none",
          "role": "main"
      }
  ],
  "flows": [
      {
          "selection": "expert_knowledge",
          "role": "main",
          "main_goal": "traffic_classification",
          "active_timeout": 60,
          "idle_timeout": 60,
          "bidirectional": false,
          "features": [
              {"log": ["octetTotalCount"]},
              {"log": ["packetTotalCount"]},
              "_activeForSeconds",
              {"log": [{"divide": ["octetTotalCount", "_activeForSeconds"]}]},
              {"log": [{"divide": ["packetTotalCount", "_activeForSeconds"]}]},
              "__maximumConsecutiveSeconds",
              "__minimumConsecutiveSeconds",
              {"maximum": ["_interPacketTimeMicroseconds"]},
              {"minimum": ["_interPacketTimeMicroseconds"]},
              "__numberof_activity_intervals",
          ],
          "key_features": [
              "sourceIPv4Address",
              "destinationIPv4Address",
              "protocolIdentifier"
          ]
      },
      {
          "selection": "feature_selection",
          "role": "main",
          "main_goal": "traffic_classification",
          "active_timeout": 60,
          "idle_timeout": 60,
          "bidirectional": false,
          "features": [
              {"log": ["octetTotalCount"]},
              {"log": [{"divide": ["octetTotalCount", "_activeForSeconds"]}]},
              {"maximum": ["_interPacketTimeMicroseconds"]},
              {"minimum": ["_interPacketTimeMicroseconds"]},
          ],
          "key_features": [
              "sourceIPv4Address",
              "destinationIPv4Address",
              "protocolIdentifier"
          ]
      }
  ]
},