4. Analysis Method

4.1. Properties

4.1.1. supervised_learning

(boolean) It marks if a classification or regression algorithm (or any technique known as supervised learning) was used during the analysis part (e.g., a decision tree). This field specifically refers to algorithms, not methodologies or frameworks. Example:

"supervised_learning": true

4.1.2. unsupervised_learning

(boolean) It marks if a clustering algorithm (or any technique known as unsupervised learning) was used during the analysis part (e.g., DBSCAN). This field specifically refers to algorithms, not methodologies or frameworks. Example:

"unsupervised_learning": false

4.1.3. semisupervised_learning

(boolean) It marks if a algorithm known as semisupervised learning was used during the analysis part (e.g., Transductive SVM). This field specifically refers to algorithms, not methodologies or frameworks. Example:

"semisupervised_learning": true

4.1.4. anomaly_detection

(boolean) It marks if a algorithm known as an anomaly detection technique was used during the analysis part (e.g., LOF). This field specifically refers to algorithms, not methodologies or frameworks. Example:

"anomaly_detection": true

4.1.5. tools

(array of objects) Here we describe the tools used for the preprocessing (data extraction, feature generation and transformations).

Example:

"tools": [
    {
        "tool": "tshark",
        "detail": "v2.0.0",
        "availability": "public"
    },
    {
        "tool": "own_python_scripts",
        "detail": "none",
        "availability": "private"
    },
    {
        "tool": "own_perl_scripts",
        "detail": "none",
        "availability": "private"
    }
]

4.1.5.1. tool

(string) We use the following keys for the nomenclature:

  1. if they are released tools, software or suites, they must be appear with the corresponding name; e.g., tshark, silk, tcpdump.
  2. if they consist on scripts or plugins for well-known programming languages, suites, packages or environments, the name must reflect such dependency; e.g., matlab_scripts, java_scripts, python_scripts.
  3. only the top-dependency must be shown (e.g., matlab). Additional relevant packages running under the same environment should be also added as tools.
  4. names start with own_ if they are presented in the paper or referred to previous publications by the same authors (and they do not fit case ‘a.’); e.g., own_matlab_scripts.

Example:

"tool": "tshark"

4.1.5.2. detail (optional)

(string) This field expresses important details about the referred tools (e.g., version, release). If no details are required, "none" should be written in the corresponding place. Example:

"detail": "v2.0.0"

4.1.5.3. availability

(strings) This field expresses the availability of the referred tool. Please, consider carefully the following default labels (values):

  • "public"
  • "private"
  • "public_on_demand"
  • "commercial"

Example:

"availability": "public"

4.1.6. algorithms (optional)

(array of objects) algorithms can contain several algorithm-objects. An algorithm-object is composed of several fields:

4.1.6.1. name

(string) The name that identifies the algorithm main family. Example:

"name": "fuzzy clustering"

4.1.6.2. subname (optional)

(string) A subname that can be more specific and refer to algorithm specification or subclass. Example:

"subname": "gustafson-kessel"

4.1.6.3. learning (optional)

(string) It identifies the learning approach of the algorithm. Please, consider carefully the following default labels (values):

  • "supervised"
  • "unsupervised"
  • "semisupervised"
  • "statistics/model_fit" the method uses predefined models, distributions and statistics and tries to check how real data fit such assumed models, i.e., it finds model parameters, gives summary values or discovers outliers based on distances to models.
  • "nest" when it embeds or operates in a higher level than other nested methods.
  • "no" it is somehow not possible to apply the word learning to the used algorithm

Example:

"learning": "supervised"

4.1.6.4. role (optional)

(string) This field is meaningful when diverse algorithms are compared. Default values are:

  • "main" the method led to the best solution.
  • "validation" the algorithm is used to establish a ground truth.
  • "competitor" for all other cases.

If only one algorithm is used, it is always "main".

Example:

"role": "main"

4.1.6.5. type (optional)

(string) It identifies the type of algorithm with regard to analysis main approaches. Please, consider carefully the following default labels (values):

  • "classification"
  • "regression"
  • "clustering"
  • "anomaly_detection"
  • "heuristics" the algorithm is quite ad-hoc and based on rules and equations defined by the authors’ expert knowledge.
  • "statistics" the algorithm belongs to the statistics domain and uses parametric or non-parametric models to explain the data.
  • "text_matching" the algorithm bases its classification and decisions on searching for specific text strings or comparing text strings.

Example:

"type": "heuristics"

4.1.6.6. metric/decision_criteria (optional)

(string) It assesses the used metric, similarity or dissimilarity distance, also the core of the decision making criteria. Please, consider carefully the following default labels (values):

  • "error/fitting_function"
  • "euclidean"
  • "mutual_information"
  • "correlation"
  • "jaccard"
  • "mahalanobis"
  • "hamming"
  • "exact_matching"
  • "manhattan"
  • "probabilistic"
  • "vote"

Example:

"metric/decision_criteria": "euclidean"

4.1.6.7. tools (optional)

(see tools)

4.1.6.8. source (optional)

(string) It identifies the origin of the algorithm. Please, consider carefully the following default labels (values):

  • "own_proposed" if authors developed and present the algorithm in the paper.
  • "own_referenced" if authors developed the algorithm but presented it in a previous publication.
  • "referenced" if authors took the method from the literature or known sources.

Example:

"source": "referenced"

4.1.6.9. parameters_provided (optional)

(boolean or string) This field expresses if the required parameters for reproducing the analysis are provided. In addition to true and false, "partially" is also possible when authors provide some parameters but some of them is missing or, for any reason, the experiment seems to be not reproducible.

Note

If the method has no parameters, use true, since you have enough information to replicate it.

Example:

"parameters_provided": "partially"

4.2. JSON example (analysis_method, complete)

"analysis_method": {
  "supervised_learning": false,
  "unsupervised_learning": true,
  "semisupervised_learning": true,
  "anomaly_detection": true,
  "tools": [
      {
          "tool": "matlab_fuzzyclusteringtoolbox",
          "detail": "none",
          "availability": "public"
      },
      {
          "tool": "own_matlab_scripts",
          "detail": "none",
          "availability": "private"
      }
  ],
  "algorithms": [
      {
          "name": "fuzzy clustering",
          "subname": "gustafson-kessel",
          "learning": "unsupervised",
          "role": "main",
          "type": "clustering",
          "metric/decision_criteria": "mahalanobis",
          "tools": [
              {
                  "tool": "matlab_fuzzyclusteringtoolbox",
                  "detail": "none",
                  "availability": "public"
              }
          ],
          "source": "referenced",
          "parameters_provided": false
      },
      {
          "name": "mad-based outlier removal",
          "subname": "double mad",
          "learning": "statistics/model_fit",
          "role": "main",
          "type": "anomaly_detection",
          "metric_distance": "mahalanobis",
          "tools": [
              {
                  "tool": "own_matlab_scripts",
                  "detail": "none",
                  "availability": "private"
              }
          ],
          "source": "referenced",
          "parameters_provided": false
      }
  ]
},