Have you ever had the need to find that one photo from your ever-growing library of photos taken with your digital SLR or a camera phone? That “picture of the kids at the barbeque” or something similar? In the years past, it would have been easy to find – you just took out the photo album and leafed to the page where the pictures from that time were. Nowadays if you posted it to Facebook or Instagram, maybe you could find it browsing through your posts. But what if the picture you are looking for wasn’t posted anywhere, but just sits somewhere on your hard drive? Before we solve that problem with AI, let’s discuss why having a lot of data is a challenge.
Everywhere in the world we are collecting more and more data. Personally, it accumulates on our phones and hard drives. In corporations and businesses, it collects in spreadsheets and different databases. But are we making the best use of our data and using it to find the answers we are looking for?
At Fracta, we work every day with lots of data from water utilities in various formats. And water utilities are no different data collectors from all of us – utilities have an ever-increasing amount of data available to them. Some data has been collected over decades, some is coming in real-time through sensors in the system. How to handle and take advantage of all the information?
The first step in utilizing your data is to just organizing and collecting it better. For example, making sure the data is always entered into e.g. a GIS system. Having a reliable record of your system is an important starting point. True benefits start to emerge when the data is used for analysis and insights. This can be done in multiple ways, but at Fracta we have successfully been using latest advances in data storage, computing power and algorithm implementations to use AI (Artificial Intelligence) and specifically Machine Learning to consume large amounts of data utilities generate and to provide useful insights from it.
Structuring and cleaning data
Moving from just having a collection of various data files and formats to actually gaining insight from the data in machine learning requires a fair amount of data cleaning effort. At Fracta when we work with water utilities, we typically start our process with a data assessment to estimate the completeness of the utility data. When we look at utility asset data and their historical break records, we often encounter varying levels of data quality – from having 99% of important attributes present to 100% of values missing from some data column like year of installation.
Even though every data issue and gap are unique, lot of data fixing can be done with software tools. Typically, a good approach is to identify any systemic problems with the data and come up with a plan how to best estimate the missing values. The solutions used can range from research of actual historic paper records to filling values with geospatial analysis using building tax record data. A good practice is to find a way to estimate and impute all the missing values, and at the same time mark in the data set which values are not original, but have been corrected. After the basic data has been cleaned and corrected it can be used as a basis for analysis and insights.
AI doesn’t need perfect data
The great benefit of machine learning is that the amount and variability of the data is no issue for machine learning algorithms. As data engineers, we don’t need to inform the algorithm to weigh the age of the pipe or the number of historical breaks on the pipe more than the value of soil pH or the average distance from a freeway. Machine learning is able to find and assign higher importance to the variables that have the highest correlation, but at the same time keep the other long tail variables in the process to detect weaker signals as well.
In the Fracta data collection process we get the basic information about the pipe assets and information about historical failures from the water utilities themselves. In addition to the basic data points, Fracta has collected a large national database that includes a lot of information about soil, weather, transportation, slope, elevation, population and building densities etc. The information is available nationwide in United States with good local granularity.
But while there is a lot of information about many properties of the environments the pipe assets are in - sometimes the information might be generic and not fully represent the unique situations on the ground. For example, if the soil used as backfill in some decades was non-native soil and has different corrosive properties than the soil surveys represent. In this case, machine learning wouldn’t find a correlation of soil properties and the pipe failures. This highlights again the benefit of machine learning that the operator of the analysis does not need to pass judgment on the importance of individual attributes.
Lacking good primary data sources for certain behavior, some other data points can act as proxies for information. In the example of corrosive backfill that was used, if no accurate data is available for when and where it was used – could machine learning still help in identifying the vulnerabilities of the pipes? Yes - data about pipe installation decades and historical breaks and break densities can act as proxy variables to indicate where similar backfill practices were used. Or if no accurate traffic load data is available across all the utility assets, building and population densities and sizes of adjacent roads can act as proxies for that information.
Supporting smart decision making
Machine learning algorithms can quickly consume vast amounts of data and build correlation models from them. But analysis by itself is not the main goal of all the data collection - instead how the analysis can be turned to actionable insights.
Best way is to start using the advanced data analysis methods like machine learning in practice by using the results to aid decision making at the utility. Using machine learning does not mean replacing engineering know-how with computer analysis, instead the predicted future behavior of the utility asset can be used as a tool to assist and aid the utility staff in making e.g. the optimal decisions on which pipes to replace in their system.
Utilities have an abundance of data at their disposal – use it in a smart way to get the results you need. Like finding the image talked about at the outset, if you need to browse through all the images on your hard-drive, when looking for a specific picture, it soon becomes a very tedious task. Implementation of AI and machine learning in photo storage applications by Apple, Google and others has added automatic image categorization as feature. This allows for example people in the pictures to be identified and filtered into easily discoverable categories. Having that advanced analysis as a simple tool available to us, now makes finding the ‘kids at the barbeque’ - picture much easier.
In the same way, advances in machine learning can be used to make more sense and create actionable insights from the large collection of data utilities have on their hard drives. Using that data to make decisions that are smart and save money doesn’t need to be difficult or time consuming. Fracta is happy to help and provide a free data assessment to give you a picture of how good your data sets are and how accurate prediction results can be built from your unique data sets.