Stages of Data Visualization: Data Exploration
Data visualization is the use of visual elements of data representation like graphs, charts, and maps to enable data analysts to have insights into phenomena and events in the environment, like patterns, trends, and correlations between various entities. It is a method of extracting valuable knowledge from unstructured, semi-structured, and structured data. The process of data visualizations consists of four stages, namely, data exploration, data analysis, data synthesis, and data presentation. Data exploration is the initial stage that involves the preparation of data for further statistical analytics.
Data exploration identifies missing values, outliers, features, and variables in data (John, & Kohli, 2016). It is the process in which the analyst cleans the data and fills missing values into the data using various computations and transformations. Data exploration facilitates the evaluation of the quality of input data as part of improving the accuracy of the output. The analyst categorizes the variables using mathematical techniques to determine central tendency like median, mean, and mode and assess the spread of data by finding the standard deviation, maximum, minimum, or the variability using various statistical techniques.
The data distribution systems involve frequency and statistical analysis to derive a basic understanding of data. This process guides the analysts in identifying missing values and predicting the values that may satisfactorily fit in the dataset (Hou, Liang, Zhang, & Zhang, 2017). Exploration helps in the determination of extreme values and relationships in data that guide the data mining process. The process would help in estimating outlier values and the necessary action needed to correct their presence. The analyst would decide to delete the outlier value, attribute, or treat it individually by identifying the condition and factors that led to its occurrence. The system would also involve feature engineering whereby the analyst creates derived or dummy variables from dependencies in data. This process would correct data through data smoothing.
References
John, J., & Kohli, K. (2016). U.S. Patent No. 9,298,856. Washington, DC: U.S. Patent and Trademark Office.
Hou, Z., Liang, X., Zhang, H., & Zhang, D. (2017). U.S. Patent No. 9,563,674. Washington, DC: U.S. Patent and Trademark Office.