Introduction
Data mining is a technique that helps in the extraction of useful patterns and rules out of a huge data set. Huge data set may be meaningless, without any pattern, just a list of data that represents nothing. In order to make data useful, data mining and data extraction are done. Data Mining is a tool that not only gives sense to historical data but also predicts future outcomes from past data. Machine learning models are created from data mining to power modern artificial intelligence. Examples of Modern artificial intelligence include search engine algorithms and recommendation systems. The insights that we derive from the exercise of data mining play a vital role in marketing, detection of fraud, scientific discoveries etc. Data mining is done in a sequential manner by the use of a process. This process includes 6 steps and is known as the data mining implementation process.
Application of data mining is done in a number of fields practically, which includes credit risk management, marketing, fraud detection, filtering spam, healthcare bioinformatics, database marketing and much more.
Literature Review
Data Mining has been studied several times earlier, and many literature reviews have been written upon his. Data Mining is considered a very powerful technology amongst all the new technologies being introduced to this human world. Technology not only saves time for humans but also produces additional items that add value to something. A number of results can be derived by suing raw data of anything. For example, answers to a survey would help the data miner understand the behaviors of customers. Data mining basically helps in finding patterns, trends and relations by considering the movement in data. Statistics and mathematics techniques help out in getting the data mined. Data does not only mean text but is also stored in the form of images and sounds.
Knowledge discovery in the database is another way that describes data mining. It additionally includes separating data and examples from an enormous assortment of information in a database. The fundamental functionalities of information mining are to apply different strategies to distinguish bits of data and to utilize this data for different basic leadership applications. Throughout the previous scarcely any year information mining and its applications in discovering examples and basic leadership, and has become a significant leap forward in the universe of innovation and it is presently getting one of the most searched out strategies for information investigation(Berry, 2000).
Information mining centers around the extraction and examination of information from a lot of information in a database. In this audit, the essential ideas, applications, difficulties and employments of information mining are examined. This audit would be useful for scientists to concentrate on difficulties and their answers for information mining. New limitations and calculations for improved security and progressively exact information recovery could be presented which will help information mining to set up itself as a different order where more investigations could be directed and more upgrades made(Passerni, 2014). The eventual fate of information mining can stretch out to hereditary qualities through hereditary calculations, counterfeit neural systems that look like natural neural systems, robotized forecast of patterns and practices. The extraction of valuable on the off chance that rules from information which is called rules enlistment is another future pattern in information mining. The future can’t be prognosticated; however there are a lot of difficulties anticipating arrangements and information mining will be a leap forward in the fate of innovation and humankind(Gorunescu, 2011).
Data Mining Procedure
Execution of data mining is linked with six steps. These six steps are below
-
Business Understanding:
In the first step, the goals and objectives of this particular process are to be shaped. Data mining goals are set in the first step. Business and client objectives are identified in this stage. Assessment factors such as assumptions, resources, and constraints are taken into account at this stage. A data mining plan is set down which includes data mining goals and business goals. Both goals need to coincide with each other in order to achieve optimized results.
-
Data Understanding:
In order to fully understand what is being done, we need to obtain the maximum data we can from the sources available within the organization. Various resources are used and it might be possible that all results do not match with each other. So this step is really complex and tricky and needs to be solved very tactfully. Data is properly explored and any missing data is obtained to get the most accurate results.
-
Data Preparation:
Data production is done in this process. The maximum project time is spent on this step of the data mining implementation process. The data from all types of resources are collected, scrutinized, transformed, formatted, analyzed, and constructed wherever needed. This process acts as a cleaning agent by sorting out clean data free from any issues.
-
Modeling:
Mathematical and statistical methods are used for the determination of patterns of the data prepared. Modeling techniques used in this step are those that fit the business goals. The quality and validity of modeling are checked in this process. The model is then tested by running it on the prepared dataset. The results obtained in this process are scrutinized by all stakeholders for confirming that they coincide with the objectives of the business.
-
Evaluation:
In this step, it is identified whether the patterns that have been obtained from modeling are regular with the objectives for which an evaluation is done. The final decision of deploying or not is taken in this step of the process.
-
Deployment:
At this final stage, the data mining extracted from the above step is implemented to be integrated within the business system. At this step, the results of data mining are decoded into simple words so that every individual connected to the business may understand whatever has been extracted from the raw data available. A detailed deployment plan is set out for delivering the data mining discoveries that have been observed. Finally, a report is created upon the work that has been carried out starting from the first step(Microstrategy.com, 2020).
Related Work
Many studies have been done in the past on the topic of data mining. In fact, there is a list of top 6 journal articles posted on this topic. These include Information processing and management, International knowledge journal, Hi-Tech Documentation, Electronic Records, Culture and Change Management and College and research libraries.
The work of Lorena Siguenza has been derived from comprehensive content, which is associated with the application of data mining. Academic sources have been used to collect data and build several narratives. It discusses data mining and its practical application techniques in detail. This article gives an exhaustive writing audit and characterization strategy for information mining procedures applied to scholastic libraries. To accomplish this, forty-one down to earth commitments over the period 1998-2014 were recognized and looked into for their immediate significance. Each article was arranged by the primary information mining capacities: bunching, affiliation, grouping, and relapse; and their application in the four principle library viewpoints: administrations, quality, assortment, and user conduct. Discoveries show that both assortment and use conduct investigations have gotten the vast majority of the examination consideration, particularly identified with assortment advancement and ease of use of sites and online administrations individually. Moreover, arrangement and relapse models are the two most generally utilized information mining capacities applied in library settings(Siguenza-Guzman, 2015).
Algorithm
The algorithm means a set of calculations done, as it can facilitate the whole process of knowledge discovery in databases. It seems a procedure that is done at the stage of data preparation. So we can say algorithms are used for the appropriate classification of available data.
Models Used:
This model Specialises in the following:
- Use of Seaborn library to visualize and understand data better (Correlation Heatmap).
- Ordinary Least Square table – linear slightest quadrangle technique for the approximation of unidentified factors in a statistical regression. Model (R squared value, P-value).
Experimental Results
The data mining in this research was done upon advertising data. Sales were extracted from the use of three mediums of advertising i.e. radio, television and newspaper. Marketing functions such as promotions and advertisements can increase customer traffic, and ultimately, these functions can contribute to the growth and success of the company. Several customer segments can be shaped according to their different cultures, buying behavior and income levels. Apart from these outcomes, the company is in a better position to retain and sustain the brand image and customer loyalty. A set of 200 entries was taken as raw data which looked like this:
Data Shape (200, 4)
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9
Boxplot Of variable
The box plot is a particular and uniform method of data distribution. Depending on the summary of five numbers, results can be achieved. It has been categorized in several phases or types
- Minimum
- First quartile
- Median
- Third quartile
- Maximum
The outliner is a kind of the whole distribution process. For Instance, different distribution patterns can be views by this outliner. If an outliner exists, it refers to some issues in the distribution process. It may also contradict the model, which has been navigated. In newspapers, these outliners are quite visible, and these can create an impact on outcomes.
q1 = Quartile 1 q3 = Quartile 3 IQR = q3-q1
Outlier Detection and Treatment
It can be asserted that the outliners are high values, which may contradict other observations. Due to changes in the measurement, the contradiction may occur. On a sample, it looms tough to match the outliner. It can be said that this observation is quite different.
After this process an outlier treatment was done due to which the box plot changed and looked like this:
Dataset For Advertising
The advertising dataset is used in analyzing the various data mining techniques for the variables ‘TV’, ‘advertising’, ’newspaper’ and ‘sales’.
- Y(dependent variable) = ‘Sales’
- X(independent Variables(X1,X2,X3)) = ‘TV’, ‘advertising’, ’newspaper’
Correlation scatter plot
Histogram Plot of Data
Data transformation of skewed data
Skewness looks imbalance in this particular distribution process. The curve may go right or left in the presentation (Kamber, 2006). It is a differentiated process, which makes the normal distribution different from this particular distribution. (data_num_skew>.75) – Positively Skewed (data_num_skew<-.75) –
TV -0.069328
Radio 0.093467
Newspaper 0.887996
dtype: float64
Newspaper 0.887996
Histogram after skewness treatment
Correlation Heat map
Correlation is an effective method, which is associated with different variables. This method tells that different kind of variables is linked or interconnected with each other. R is a key term, which has been used to define coefficient. It indicates the strength of the straight line. This straight-line exists between two different variables. Values (from +1 and -1) are visible. There are some following points, which can be used for interpreting the correlation coefficient.
0 shows that there is n linear connection or relationship
+1 depicts a positive connection or relationship. With an increase in the one value, other value can also be increased, and it highlights the positive connection or relationship.
-1 highlights a negative connection or linkage. It is clear that if one value increases, the second value can be decreased, and it seems a negative relationship.
We can observe different values such as 0 and 0.3. It seems a positive relationship, but it is weak. It is backed by a shaky rule.
Values such as 0.3 and 0.7 indicate the moderate connection between variables. It can be referred to as negative, as the fuzzy rule is to be used to get results.
Values such as 0.7 and 1.0 highlight a strong positive relationship. These values can also show a negative relationship. It has been observed through the firm linear rule.
TV radio newspaper
TV 1.000000 0.054809 0.056648
radio 0.054809 1.000000 0.354104
newspaper 0.056648 0.354104 1.000000
Correlation diagram is as below:
VIF Factor>10; Dangerous situation
The inflation factor is also visible, which helps to identify multicollinearity in the statistical analysis. Multicollinearity can occur when predictors contain a specific correlation. This correlation can be visible in the form of independent variables. It can also be stated that the presence can create a negative impact on results. The variance inflation factor is a kind of estimation, which depicts the inflation which is possibly due to the existence of multicollinearity. A VIF >10 is said to be unwanted.
Test Train Split for LR
Train = 80% Test = 20%
Intercepts and Coefficients
[(‘TV’, 0.04377260306304603), (‘radio’, 0.19343298611600768), (‘newspaper’, -0.002228792805605395)]
3.254097114418883[ 0.0437726 0.19343299 -0.00222879]
Predicted Values
TV radio newspaper Actual Sales Predicted Sales59 210.7 29.5 9.3 18.4 18.1625305 8.7 48.9 75.0 7.2 12.92663220 218.4 27.7 53.4 18.0 18.053110198 283.6 42.0 66.2 25.5 23.64464752 216.4 41.7 39.6 22.6 20.70438419 147.3 23.9 19.1 14.6 14.282280162 188.4 18.1 25.6 14.9 14.94493555 198.9 49.4 60.0 23.7 21.382330
R2 is one of the most prominent measures, which can highlight the variance proportion. This variance proportion indicates the dependent variable. Interestingly, an independent variable always explains the dependent variable. RMSE is a square root concerning residuals. The fit to the model or data is visible. Data points are to be obtained or reviewed, and ultimately, there is a need to find a connection with predicted values or the model
Ordinary least squares (OLS)
OLS, referred to as ordinary least square regression, has been utilized in the statistical regression. It can be said that the analysis can be simple or complicated, as it depends on the number and nature of the variable. P can be referred to as profitability, which is also to be identified. H0 is a null hypothesis. We can define the extreme term by testing the hypothesis. A In this whole process, P can also be used to contradict H0. However, it can only be contradicted in case of true value. Direct probability is not visible.
Pval>0.05 ; Accept NULL hypothesis
Intercept 2.938889TV 0.045765radio 0.188530newspaper -0.001037
Techniques Used
Correlation
Correlation is an effective procedure or technique, which can help to understand the connection between variables. Of course, different variables are connected to each other.
We have created a heatmap using seaborn library to find the strength of the correlation among variables.
Regression
Regression is also one of the prominent techniques for statistical data analysis. By using this regression analysis, different values can be predicted. These values can be referred to as continuous volumes. If an organization has to predict the cost of products or services, it can use the regression analysis to find the accurate value and analyze further procedure (Skiadas, 2019).
Regression is a data mining technique used to predict a range of numeric values (also We have chosen regression and outlier treatment because we have taken the advertisement data in which we can perform predictive analysis on sales depending upon various factors like TV, Newspaper, Radio advertisement in order the predict the future sales
Interpretation
The OLS Regression consequence demonstrated that the adjusted R square value is 0.896, which means that 89.6% of the sales can be explained by these factors which are tv, radio and newspaper.
Also, the p-value is less than 0.05 for all the factors. Hence all the factors (TV, Newspaper and Radio) are significant for our model.
Outlier Detection and Treatment
High values can be highlighted by outliners. These extreme values can be derived. However, the deviation is always possible. The measurement process is triggered by some changes or variations. Experimental errors can also be occurred. It can be asserted that the outline is a kind of review, which may diverge from the sample pattern (Phung, 2018).
Discussion
Data Mining techniques are used worldwide and have a number of applications. In our research, we applied the concepts of data mining along with statistical and mathematical tools; we obtained results that were meant to be very useful for the business. These results were not only useful for the identification of sales but also for understanding what is affecting sales the most. The whole process of data mining resulted in giving out meaningful results from meaningless data that was obtained from excel. This shows the importance of data mining in our everyday routine. The Data Mining of the coming age would be far better than the one we are using today, studies have predicted (Piatetsky-Shapiro, 2000).
Conclusion
Data mining methods are applied in a wide scope of spaces where a lot of information is accessible for the distinguishing proof of obscure or shrouded data. In our research, we explored how data mining could be used to interpret a large amount of data that was available from three sources of advertisement in respect of sales. Sales have been taken as a variable, which is dependent. On the other hand, advertisement has been taken as an independent variable and applied statistical and mathematical techniques to obtain histograms and skewed diagrams to understand the effects of advertising channels upon sales. Data Mining techniques on Advertising dataset namely Outlier Imputation, Correlation and Linear Regression Analysis have been performed. An additional OLS model has been created to understand the strength of the model and the value of the significance of variables. Our correlation shows that sales are affected most when there is a change in TV activity which means that TV advertisement effects most sales.
Future Work Possibilities
A new data mining module has been proposed by experts. A noticeable increase if efficiency and effectiveness is expected from this new module. It is intended that future study would be done upon this new module and its implementation in the real world (Roth, 1979). Development is a never-ending process of the human world. Since the birth of humans, developments in life as a whole have been made. Technology development is part of this which effect human society is a positive way as well as negative. Ignoring negative aspects, development of technology such as data mining has saved time and efforts of humans that would otherwise have been put to obtain the results intended.
References
Berry, M., 2000. Master Data Mining: The Art and Science of Customer Relationship Management. Wiley & Sons.
Gorunescu, 2011. Data Mining: Concepts, Models, and Techniques. Springer.
Kamber, M., 2006. Data Mining, Southeast Asia Edition. Elsevier.
Microstrategy.com, 2020. Data Mining. [Online] Available at: https://www.microstrategy.com/us/resources/introductory-guides/data-mining-explained [Accessed 24 January 2020].
Passerni, A., 2014. Improving activity recognition by segmental pattern mining.
Phung, D., 2018. Advances in Knowledge Discovery and Data Mining. 22nd ed. Springer.
Piatetsky-Shapiro, 2000. The Data-Mining Industry Coming of Age. IEEE Intelligent Systems.
Roth, H., 1979. A Cognitive Model of Planning. Cognitive Science, 3, pp.275-310.
Siguenza-Guzman, L., 2015. Literature Review Of Data Mining Applications In Academic Libraries. The Journal of Academic Librarianship, 41, pp.499-510.
Skiadas, C.H., 2019. Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining. John Wiley & Sons.