Using AWS QuickSight and Athena to query DataDog Events

DataDog is an observability platform that provides a wide range of solutions for monitoring and alert of cloud scale applications. I have had the pleasure of working with DataDog these past years. I have used it to perform Application Performance Monitoring (APM), Database Monitoring (DB), Log Aggregation and recently started to use its Real User Monitoring (RUM) capabilities.

RUM is a performance monitoring process that collects detailed information about an end-users interactions with a given applications. You can use it to measure how an end-user is interacting with the page elements, detect their mouse movements and frustrations. Combined with APM tracing, RUM gives you a complete view of user activities, service traces and database interactions.

You can use all of this collected data and create dashboards and visualizations in your DataDog console. These can help anytime be alerted and you can use them to troubleshoot your applications very effectively. However, collected data is controlled by certain data retention policies. These vary from service to service but I found out RUMs data retention is by default only 1 month. This means your dashboards can't go back and you can't really perform historical analysis. DataDog support team can work with you to increase this retention period to 3 months. However, if you wanted to have historical analysis performed year over year you wouldn't be able to do it.

This is where DataDogs RESTful APIs can be helpful. You can use EVENTS API to collect and download RUM EVENTS from DataDog. You can then use this information in your own way and perform year over year and historical data comparisons.

This write up will focus on how to achieve this using AWS lambda to execute APIs, AWS S3 to store RUM data, AWS Athena to query and AWS QuickSight to analyze and visualize this data.

Here is a very basic architecture diagram explaining this set up. Here are the list of components

AWS Lamda Function: a simple lambda function that executes with a CloudWatch time based event trigger. You can set the frequency to whatever works for you but you may want to work with DataDog support and ensure you won't get throttled.
AWS S3 Bucket: This is the storage mechanism for storing RUM EVENT data. I decided to create a simple folder structure for events by year, month, and day. I did this so that I can easily partition the data by those 3 attributes making it easy to query based on time.
AWS Secrets Manager: In order to access, DataDog API you will need to create some secrets. You will need to use them in your controller logic (i.e lambda function) so that you can get authorized to access the REST endpoints of DataDog platform. AWS Secret manager is very easy to work with and you can use it to store them there. You can use Lambda environment variables to pass it secrets where you can decrypt and access them for usage.
AWS Athena: Athena is a serverless query service (using SQL) that makes it easy to query data in S3 buckets.
AWS QuickSight: QuickSight is a visualization tool that you can use to create highly interactive dashboards. QuickSight can use many different datasources to query, analyze and visualize. It has native integration with AWS Athena.

Accessing and Storing DataDog Events

I ended up using Java for my Lambda function. I used DataDogs REST client for Java to make my API executions. Here is a quick snippet of this code.

package com.test.service;

import java.io.UnsupportedEncodingException;
import com.test.exception.ServiceException;
import com.test.RunDate;
import com.test.S3Repository;
import com.datadog.api.client.ApiException;
import com.datadog.api.client.PaginationIterable;
import com.datadog.api.client.v2.api.RumApi;
import com.datadog.api.client.v2.model.RUMEvent;
import com.datadog.api.client.v2.model.RUMSort;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;

@Service
@Slf4j
@RequiredArgsConstructor
public class DataDogService {
    private final RumApi rumApi;
    private final S3Repository s3Repository;

    @Value("${app.datadog.page-limit:100}")
    private int pageLimit;

    @Value("${app.datadog.filter-query: env:dev @context.custom:whatever service:(application1 OR application2)}")
    private String filterQuery;

    public void collect() throws ServiceException {

        try {         
            final var runDate = new RunDate();
            final var lastRunDate = s3Repository.getLastRunDate();

            RumApi.ListRUMEventsOptionalParameters parameters = new RumApi.ListRUMEventsOptionalParameters();
            parameters.filterFrom(lastRunDate);
            parameters.filterTo(runDate.getRunDate());
            parameters.filterQuery(filterQuery);
            parameters.pageLimit(pageLimit);
            parameters.sort(RUMSort.TIMESTAMP_ASCENDING);

            PaginationIterable<RUMEvent> result = rumApi.listRUMEventsWithPagination(parameters);
            for (RUMEvent item : result) {
                s3Repository.saveRumEvent(item);
            }

            s3Repository.updateLastRunDate(runDate);

        } catch (ApiException | UnsupportedEncodingException e) {
            throw new ServiceException(e.getMessage(), e);
        }
    }
}

In this code, collect function is run to reach out to DataDog API at a given frequency. I use a simple "filterQuery" parameter to tell DataDog what events I am interested in. DataDog provides a very capable query syntax for its metrics and event data. You can use that to your advantage here. In this case, I am just showing you that I can use environment variables, custom context variables and application and service names to query. One other thing to note here is the pagination. You want to make sure to set a sensible value here so that you can just simply paginate the data.

Once the data is paginated and returned, I simply loop over it and I create a simple JSON file for each event data in the result set that returned. Here is how that looks once it is saved.

Creating an Athena Table for S3 Bucket

Once the folders are create and data is being saved, next step is to create an Athena table so that you can load the S3 data for querying.

You first need to create a database and create a table and point the tables data source to this particular S3 bucket.

Here is a great Amazon write up that explains how this works in details - https://aws.amazon.com/premiumsupport/knowledge-center/analyze-logs-athena/. You can use AWS Glue Crawlers to automatically detect data types and structures for Athena as well. This is not currently in the scope of this write up.

CREATE EXTERNAL TABLE `data_dog_events`(
  'id' string COMMENT 'from deserializer',
  'type' struct COMMENT 'from deserializer',
  'attributes' struct<?>  COMMENT 'from deserializer')
PARTITIONED BY (
  'year' string,
  'month' string,
  'day' string)
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
  'case.insensitive'='true')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://your_bucket/events'

In this case, I left out the attributes definition which corresponds to the JSON structure of the Event data schema from DataDog API.

Once the table is created, you will need to run some commands to load the JSON data and ensure partitions are correctly loaded. One way to do this is to use "ALTER TABLE" command with "ADD PARTITION" option. You can also use MSCHK Repair command. Check it out here - https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html

Creating a DataSet and QuickSight Dashboard

Once Athena table is defined and you are able to use SQL to query your EVENTS data, you can easily connect AWS QuickSight to Athena.

Here is step by step documentation from AWS that explains how to do this - https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-athena.html. Once the DataSet is created, it really becomes about using the DataSet with QuickSight visualizations.

Here is a list of QuickSight visuals, https://docs.aws.amazon.com/quicksight/latest/user/working-with-visual-types.html, that are available at the time of writing. You can use the information to create visuals to help you identify questions like

Number of Unique Users
Number of Sessions
Number of Users to Number of Sessions
Your device and browser information and how that gets distributed to sessions
You can see where your users may be geo spatially accessing your applications from a map visual.

Most importantly you can use the year, month, day as controls to your dashboard so that you can use them to query aggregated data over multiple days, months and years. This is not possible with data retention in place with DataDog.

Conclusion

S3, Athena, Lambda functions are pretty cheap to utilize as they are serverless. When I set up QuickSight I used SPICE which is an in memory analytics engine. It will query Athena data (daily) in my case and keep the information in memory. This makes the Dashboard really fast. You can 1GB free with SPICE. At the time of writing, SPICE was 0.38 cents a month for 1 GB. Your lamda function will execute 24 times a day and S3 storage is one of the cheapest.

Depending on how many users are going to use your dashboard, you will need to pay QuickSight. There is about 18$ a month cost for authors. Viewing is charged for every 30 minutes. So more users you have the more it will cost as is the case with many AWS services. You can obviously use other visualization engines and use Athena SQL to query this data.

Ismail Seyfi's Blog

Search This Blog