During the many years I have been working in analytics, the quantity and complexity of available data has nothing but grown. This makes it harder and harder for a data scientist to quickly uncover insights hidden in the data and help reach business goals. The good news is: Artificial Intelligence can help make this process faster and more efficient – particular in areas where traceability might only be a second priority, such as fraud prevention.
In financial processes, we use existing data to predict the future behaviour of customers by developing machine learning models. For example in Credit Risk Management, data about past payment behaviour can be used to build a machine learning model that helps predict the probability of future payment default. The variety of features (such as past payment behaviour measured as maximum days past due or number of times overdue, specific information of the credit report, or available household income) largely determines the accuracy and success of any predictive model created with Advanced Analytics methods. Consequently, the creation of features from available raw data is one of the (if not the) most crucial tasks of a data scientist to build a powerful model with excellent selectivity.
However, with increasing amounts of data available, it becomes more and more time-consuming and complex for the data scientist to consider all variables and their possible combinations and transformations when building a machine learning model. This is especially the case when working with data from new sources in a customer-specific data structure. To overcome this challenge, we at Arvato Financial Solutions use Artificial Intelligence in the form of Automated Feature Generation.
In Automated Feature Generation or Automated Feature Engineering, software automatically combines all available variables and creates either an unlimited number of variable combinations (features) or a limited number of more specific features. Our data scientists with their understanding of the business and the data then explore these automatically generated features and decide which ones are both statistically relevant and relevant in the domain as a driver of the predicted target, e.g. the probability of default in the credit risk management.
In other words, the question is: Which automatically generated features will improve the selectivity of the model incrementally so that it makes sense to include them? Automatic Feature Generation helps generate potential model features on a large scale, which might otherwise be a time-consuming task for our modellers and/or obviously dependent on their experience. These features can be tested on a training set for their statistical relevance measured as their information value or univariate Gini. Our data scientists then typically discard features that have no explainable relevance for this domain to eliminate features that are just spuriously correlated. They then test the remaining ones again statistically on a validation or test data set.
Automated Feature Generation therefore does not only save significant time in the model-building process by reducing the manual work of our data scientists, but it also removes human subjectivity as well as balances different levels of experience from the model building process. Features that a data scientist might dismiss as non-relevant at first sight or not even think of are now suggested for his review thanks to the automated process.
Examples where we at Arvato Financial Solutions successfully used Automated Feature Engineering include building a fraud machine learning model for a large telecommunications provider. In this case, two features could be integrated into the machine learning model thanks to Automated Feature Generation: one feature combining the age of clients and the number of made payments and one feature combining credit rating information at the time of application and the existing customer indicator. This improved the selectivity significantly.
Do you want to know more about Automated Feature Engineering? Are you looking for help with getting your data ready for modelling? Do you have any questions? We are looking forward to your email.