Longitudinal user behavioral data are collected to track users’ interaction with digital products or information systems at different points in time. It is ubiquitous in a wide range of digital businesses. However, when it comes to leveraging that data, problems with missing data may prove just as common.
Due to various reasons such as poor data onboarding and unreliable data sources, many businesses lose data. This missing data naturally results in significant challenges to providing accurate insights. Moreover, these issues often present complex missing data patterns, which adds to the difficulty of handling missing data.
At Amplitude, our digital optimization system helps companies track longitudinal user activity on digital products, which is used to generate insights about product optimization, user engagement, churn prevention, and more. To ensure trustworthy and accurate results for all those tasks, it has been a long-standing mission of our machine learning team to effectively handle missing data patterns.
In this upcoming paper to be published on The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’ 21), we collaborated with researchers from the University of Illinois at Urbana-Champaign (UIUC) to propose a new machine learning method for solving this long-standing problem of missing data in data analysis. Particularly, we proposed a multiresolution tensor completion method for handling missing data patterns in our event-based user behavioral data.
Tensor completion is a classical data imputation technique for multi-dimensional data. For example, given a 3D tensor of user-product-time, the tensor element (x, y, z) corresponds to a binary number indicating whether user x bought product y at month z. Often, there is a large portion of missing data in an input tensor. Tensor completion aims at estimating the missing elements in the input tensor. In our proposed multiresolution tensor completion method (abbreviated to “MTC”), we tackle two missing data patterns for achieving more accurate tensor completion:
- Partial observation: Only a small subset of data elements exist in the input data tensor. For example, we only observe a small percentage of user-product relevance scores based on limited historical user transactions, while the vast majority of the user-product scores are unknown.
- Coarse observation: Some tensor dimensions only have coarse and aggregated patterns (e.g., monthly summary instead of daily reports).
The specific testing bed used in the paper is about healthcare product analytics: spatio-temporal disease and healthcare demand prediction using historical observation of disease counts at specific locations and time points. In particular, a fine-granular observation tensor is constructed as a 3D tensor of disease code-by-zip code-by date, while two aggregated 2D tensors (i.e., two matrices) are also present: (1) disease categories (coarse-level diseases) by county, and (2) disease categories by week. The goal is to accurately estimate all the entries in the fine-granular 3D tensor.
Our MTC Method
To handle missing data patterns with the MTC method, we first apply subsampling on all accessible information (i.e., the tensors and the known aggregation matrices) into the lowest resolution. Different subsampling strategies are proposed depending on the data type. For continuous dimensions such as time, regular sampling is used, while for categorical dimensions the bias sampling is applied to focus on the feature dimensions of large values or denser observation. For example, in the spatio-temporal disease tensor presented in the paper, time is sampled with regular intervals (e.g., every t observations), and disease dimensions are sampled to keep common diseases.
Next, we solve the low-resolution problem by applying the tensor optimization solver. In the paper, we propose a constraint-alternating least-square approach to efficiently solve the optimization problem.
Finally, we interpolate the solution into the higher resolution to initialize the high-resolution factors. We repeat this process and find a good initialization for the original fine-granular problem.
In our KDD paper, MTC was evaluated on real-world spatio-temporal demand prediction scenarios with a particular healthcare industry setting. The experiments are conducted to predict future COVID cases for each location in the United States through mining the longitudinal public health data generated during the interactions of patients and healthcare systems.
We evaluated our MTC algorithm against leading tensor completion baselines including Block Gradient Descent (BGD), B-PREMA and CMTF-OPT on the following accuracy and efficiency metrics, such as Percent of Fit (PoF), CPU time, and peak memory usage. MTC outperforms all baselines by a great margin on PoF and CPU time while having about the same low space complexity, which shows great promises and efficacy for its deployment in production.
Leading the Way on Handling Missing Data
At Amplitude, we strive to help all our customers obtain user and product insights that are trustworthy and consistent. The proposed MTC approach is one of the efforts we have made toward delivering our mission, as it can easily translate into a general solution that powers large user-product interaction data from all industries. Generally speaking, we can create an input tensor of user-product-time, and aggregate tensors of user-product and user-time to apply the proposed method for accurately estimating every element in the user-product-time tensor. The downstream applications of such a task are directly related to our key products—Amplitude Analytics, Amplitude Recommend, and Amplitude Experiment—which help forecast future user behaviors or recommend what content to show to the users at any given moment during the user interaction journey, or help impute user data to reduce bias in experimentation.
This work is just one of the many ways our team is leading the way on data analysis in digital business. Interested in getting involved? Check out our careers page to learn more.
This work is in collaboration with Professors Sun and Solomonik at University of Illinois Urbana-Champaign and with industry collaborators at IQVIA. A preprint version of the full paper can be found at this link and will be presented at the KDD conference from Aug 14-18, 2021.