Taking Tableau Further: Create Your own Content Management Tool, Part 3

16 minute read

28 Aug 2017

The Last Piece of the Puzzle

In the final part of this series, we’ll build upon the data we pulled from Tableau’s Postgres DB in part two to complete our Tableau Server Content Management Tool. This post will focus on how to use Tableau Server’s APIs to automate the process and regain valuable system resources. To complete our tool, we’ll be using the REST API and the lesser known Vizportal API. Before we being, let’s review our feature set…

1) Check for all workbooks and datasources that have not been used in N days.

2) Be able to “Whitelist” content that should not be altered. We can do this by “tagging” content directly on server and then reading those tags via the REST API.

3) Based on periods of inactivity and current schedules, decide what course of action to take: * Disable the Extract Schedule * Remove the Workbook/Datasource from Server * Reschedule the Extract to a More Efficient Time Slot

4) Send the Content Owner an email explaining the removal of their content from Server and include the .twb or .tds file as an attachment. If a thumbnail image of the Workbook is available, include that as well.

Setting up the Config

In the last post I introduced a single yaml configuration file that controls the settings needed for our tool to run. To enable our remaining features, we need to add a few more settings to it:


# Location for downloading Workbooks and Datasources from Tableau
download_settings:
    download_path: '/Path/To/Where/Tableau/Content/Will/Be/Downloaded/'

# Settings for our email delivery of content messages
email_settings:
    test_email: 'test@server.com' # A valid email address you can use for testing
    smtp_server: 'smtp.host.com' # The SMTP hostname for your email provider
    smtp_user: 'name@server.com' # A valid email that you can send messages from
    smtp_pw: 'password'
    notify_users: False # We'll use this as an override flag for easily disabling emails

# These settings control how we rebalance extract schedules. I'll go into more detail on this later but with this setup we can easily change our thresholds for content management
schedule_settings:
    180:
        Any: 'Disable'
    90:
        Any: 'Monthly'
    30:
        Weekly: 'Monthly'
        Daily: 'Weekly'
        Hourly: 'Weekly'
    14:
        Any: 'Weekly'
    7:
        Hourly: 'Daily'
    2:
        Any: 'Daily'

Now comes the fun part. Since most of what we’re looking to do is possible with the REST API we’ll start there.


Tableau Server REST API

Tableau Server ships with a really powerful set of API endpoints that can manage almost everything on our feature list.

Since we’re going to be using our config settings through out, make sure you include this block at the top of your code. I also recommend that you use an account with Server Admin privileges so you don’t run into any permissions issues.


import os, sys
from operator import itemgetter
import requests
import xml.etree.ElementTree as ET
import zipfile
import shutil
import logging #If you choose to enable logging, this library makes it really simple
import yaml

# import settings from yaml config
with open("config.yml", 'r') as ymlfile:
    cfg = yaml.load(ymlfile)

tab_server_settings = cfg["tab_server_settings"]
download_settings = cfg["download_settings"]
email_settings = cfg["email_settings"]
schedule_settings = cfg["schedule_settings"]

tab_server_url = tab_server_settings['tab_server_url']
tableau_username = tab_server_settings['tab_server_url']
tableau_user_pw = tab_server_settings['tableau_user_pw']

Sign In/Sign Out of Tableau Server

Before we can manage anything, we need to authenticate and sign into Tableau Server.

Detecting Your Tableau Server & API Versions

The REST API requires that we know the version of the API we’re using and that differs depending on what version of Tableau Server you are running. You could detect this programmatically or you can refer to the chart found here. To simplify things, we’ll hardcode this for now:

api_version = '2.6' # The latest version as of Tableau Server 10.3

# We'll need this to properly parse the XML responses from our API calls
if api_version <= '2.3':
    xmlns = {'t': 'http://tableausoftware.com/api'}
else:
    xmlns = {'t': 'http://tableau.com/api'}

We’ll also need to be able to parse our XML responses without running into issues with unicode characters. If you’ve seen the REST API examples from Tableau, this will look familiar:

def _encode_for_display(text):
    """
    Encodes strings so they can display as ASCII in a Windows terminal window.
    This function also encodes strings for processing by xml.etree.ElementTree functions.
    Returns an ASCII-encoded version of the text.
    Unicode characters are converted to ASCII placeholders (for example, "?").
    """
    return text.encode('ascii', errors="backslashreplace").decode('utf-8')

Now we can define our sign_in and sign_out functions:

Sign In

This will return the token, site_id, and user_id that we’ll need to make our other REST API calls.

def sign_in(name, password, site=""):
    global token
    url = tab_server_url + "api/%s/auth/signin" % (api_version)

    #Build the request
    xml_payload_for_request = ET.Element('tsRequest')
    credentials_element = ET.SubElement(xml_payload_for_request,'credentials',name=name, password=password)
    site_element = ET.SubElement(credentials_element, 'site', contentUrl=site)
    xml_payload_for_request = ET.tostring(xml_payload_for_request)

    #Sign in to Tableau Server
    server_response = requests.post(url, data=xml_payload_for_request)
    if server_response.status_code != 200:
        print "Problem signing into Tableau Server: %s" % (server_response.text)
    else:
        print "Sign in to Tableau Server: %s" % (url)
    
    xml_response = ET.fromstring(server_response.text)

    #Retrieve the token and Site ID (we'll need these for the subsequent API calls)
    token = xml_response.find('t:credentials', namespaces=xmlns).attrib.get('token')
    site_id = xml_response.find('.//t:site', namespaces=xmlns).attrib.get('id')
    user_id = xml_response.find('.//t:user', namespaces=xmlns).attrib.get('id')

    print "Sign in to Tableau Server successful!\n"
    print "token = "+token, "site_id = "+site_id, "user_id = "+user_id+"\n"
    return token, site_id, user_id

Sign Out

When we’re done working with Tableau Server we need to log out and end our session.


def sign_out():
    global token
    url = tab_server_url + "/api/%s/auth/signout" % (api_version)
    headers = {'x-tableau-auth':token}
    server_response = requests.post(url, headers=headers)
    token = None
    if server_response.status_code != 200:
        print server_response.text
    else:
        print "Signed out of Tableau Server successfully.\n"
    return

A Note about Switching Sites

One of the most powerful features of Tableau Server is the ability to stand up separate Sites. Unfortunately, this can make managing content a bit of a bear. Workbooks and Datasources are not common across Sites and if you need to manage the content for your entire Server, you’ll probably need to do so on a Site-by-Site basis. When you sign in to Tableau Server, note how the calls require an element called the contentUrl. This is related to the Site you are logging into. In the example above, we left this blank which signs us into the Default Site. Beginning with version 10.3, Tableau has included a Switch Site endpoint which will allow you to change Sites without providing your user name and password again. If you’re running a version older than 10.3, you’ll need to repeat the Sign In step for every Site you want to manage content for.

Query Content on Server

Using the token and site_id from the sign_in function, we can now request content details from the server. We’ll keep parts of this call generic so we can use it as a template for building our other functions.

Note: This function contains 2 different endpoints, one for Workbooks and one for Datasources. Make sure you specify the correct content_type when using it.

def query_content(content_type, site_id, content_luid):
    """
    Queries the server for a specific piece of content.

    'content_type'  is the type of content to be queried. Acceptable values are 'workbook' and 'datasource'.
    'site_id'       is the ID (as a string) of the site on the server where the content is located. Use the value retrieved from sign_in()
    'content_luid'  is the ID (as a string) of the content on the server to to be queried.
                    Note that this is not the same as the workbook_id or datasource_id found in the Postgres database.
                    It is the unique id or luid value.

    Returns the name of the content and its owner_id (as a string).
    """
    headers = {'x-tableau-auth':token}

    # Configure call for querying Workbook
    if content_type == 'workbook':
        workbook_url = tab_server_url + "/api/%s/sites/%s/workbooks/%s" % (api_version,site_id,content_luid)
        content_response = requests.get(workbook_url, headers=headers)
        if content_response.status_code != 200:
            print content_response.text

        xml_response = ET.fromstring(_encode_for_display(content_response.text))

        workbook_name = xml_response.find('.//t:workbook', namespaces=xmlns).attrib.get('name')
        workbook_owner_id = xml_response.find('.//t:owner', namespaces=xmlns).attrib.get('id')

        tags = xml_response.findall('.//t:tag', namespaces=xmlns)
        workbook_tags = []
        for t in tags:
            workbook_tags.append(t.attrib.get('label'))

        print "Workbook name = %s and is owned by %s" % (workbook_name, workbook_owner_id)
        return workbook_name, workbook_owner_id, workbook_tags

    # Configure call for querying Datasource
    elif content_type == 'datasource':
        datasource_url = tab_server_url + "/api/%s/sites/%s/datasources/%s" % (api_version,site_id,content_luid)
        content_response = requests.get(datasource_url, headers=headers)
        if content_response.status_code != 200:
            print content_response.text
        xml_response = ET.fromstring(_encode_for_display(content_response.text))

        datasource_name = xml_response.find('.//t:datasource', namespaces=xmlns).attrib.get('name')
        datasource_owner_id = xml_response.find('.//t:owner', namespaces=xmlns).attrib.get('id')

        tags = xml_response.findall('.//t:tag', namespaces=xmlns)
        datasource_tags = []
        for t in tags:
            datasource_tags.append(t.attrib.get('label'))

        print "Datasource name = %s and is owned by %s" % (datasource_name, datasource_owner_id)
        return datasource_name, datasource_owner_id, datasource_tags

Download Content from Server

Using the query_content function as a template, we can build another function to download the content which we can then archive or send as an attachment to its owner. Here are the endpoints we’ll need for downloading Workbooks and Datasources.

GET /api/api-version/sites/site-id/workbooks/workbook-id/content

or

GET /api/api-version/sites/site-id/datasources/datasource-id/content

Remove Content from Server

Use the template again to build a function for removing content, this time using the DELETE option. Here is the format of the API call from Tableau’s API Reference.

DELETE /api/api-version/sites/site-id/workbooks/workbook-id

or

DELETE /api/api-version/sites/site-id/datasources/datasource-id

Unzip Content (optional)

If you downloaded your Workbooks as .twbx files or Datasources as .tdsx files, you might want to consider only sending the .twb or .tds when you email them to their creators to save on space. The easy way to do this is to rename the file as a .zip file and then extract the Workbook or Datasource file.

You can skip this step if you downloaded your content without the data.

def unzip_content(content_type, file_name, content_name):

    # Rename the file to .zip
    base = os.path.splitext(file_name)[0]
    os.rename(file_name, base + ".zip")
    zip_file_name = file_path + content_name + ".zip"

    #Extract zip file
    zip_ref = zipfile.ZipFile(zip_file_name,'r')
    #Location where extracted files will go
    if content_type == 'workbook':
        zip_ref.extractall(file_path+"Extracted Workbooks")
        zip_ref.close()
        print "Extract of .twb file complete\n"

        #Delete everything that's not the .twb file
        for f in os.listdir(file_path+"Extracted Workbooks"):
            if os.path.isdir(file_path+"Extracted Workbooks/"+f):
                shutil.rmtree(file_path+"Extracted Workbooks/"+f)
            elif not f.endswith('.twb'):
                os.remove(file_path+"Extracted Workbooks/"+f)
        print "Workbook cleanup complete\n"
        new_file = base.replace(download_path,'')+".twb"

    elif content_type == 'datasource':
        zip_ref.extractall(file_path+"Extracted Datasources")
        zip_ref.close()
        print "Extract of .tds file complete\n"

        #Delete everything that's not the .tds file
        for f in os.listdir(file_path+"Extracted Datasources"):
            if os.path.isdir(file_path+"Extracted Datasources/"+f):
                shutil.rmtree(file_path+"Extracted Datasources/"+f)
            elif not f.endswith('.tds'):
                os.remove(file_path+"Extracted Datasources/"+f)
        print "Datasource cleanup complete\n"
        new_file = base.replace(download_path,'')+".tds"

    #Delete the zip file
    os.remove(zip_file_name)
    print "File "+zip_file_name+" removed\n"
    return new_file

Download Workbook Images

To make our emails look professional, we can download a thumbnail of the preview image for a Workbook and include that with the .twb file.

GET /api/api-version/sites/site-id/workbooks/workbook-id/views/view-id/previewImage

Query Schedules on Server

And in order to modify content schedules, we’ll need to know what schedules are available on the server…

def query_schedules():
    global server_schedules
    headers = {'x-tableau-auth':token}
    schedule_url  = tab_server_url + "/api/%s/schedules" % (api_version)
    response = requests.get(schedule_url, headers=headers)
    if response.status_code != 200:
        print response.text

    xml_response = ET.fromstring(_encode_for_display(response.text))
    server_schedules = []

    schedules = xml_response.findall('.//t:schedule', namespaces=xmlns)
    for s in schedules:
        schedule_luid = s.attrib.get('id')
        schedule_name = s.attrib.get('name')
        schedule_state = s.attrib.get('state')
        schedule_frequency = s.attrib.get('frequency')
        schedule_type = s.attrib.get('type')
        schedule_details = {'schedule_luid':schedule_luid, 'schedule_name':schedule_name, 'state':schedule_state, 'frequency':schedule_frequency, 'type':schedule_type}
        server_schedules.append(schedule_details)

    return server_schedules

Tableau Vizportal API

While most of what we want to achieve is possible via the REST API, there are a couple of tasks that currently aren’t possible with it but are crucial to our tool, namely working with Extracts. For those we’ll use the Vizportal API.

Note: When using Vizportal, the ID values are not the same luid values that we use with the REST API. With Vizportal, they are the numeric ID that is found in the relevant table of the Postgres DB. Since we’re coupling these calls with the data we retrieved from Postgres, you should have what you need but you may need to validate the IDs if your calls aren’t working.

Signing into Vizportal

Signing in to Tableau Server via Vizportal is a lengthy process. I won’t go into it here but you can read all about how to do that via one of my prior posts.

Update Extract Schedules

Using the Vizportal API we can change the schedule associated with an extract, like moving something from a Daily refresh to once a Week. Pay close attention to the format of the payload variable. The Vizportal API is very particular about how these are structured.

def updateExtractSchedule(task_id, schedule_id, xsrf_token):
    payload = "{\"method\":\"setExtractTasksSchedule\",\"params\":{\"ids\":[\"%s\"], \"scheduleId\":\"%s\"}}" % (task_id, schedule_id)
    endpoint = "setExtractTasksSchedule"
    url = tab_server_url + "/vizportal/api/web/v1/"+endpoint
    headers = {
    'content-type': "application/json;charset=UTF-8",
    'accept': "application/json, text/plain, */*",
    'cache-control': "no-cache",
    'X-XSRF-TOKEN':xsrf_token
    
    response = session.post(url, data=payload, headers=headers)
    if response.status_code != 200:
        print "Failed to update extract schedule"
    else:
        print "Extract schedule successfully updated"
    return response

Delete Extract Tasks

We can also use Vizportal to delete an Extract refresh task. This will not remove the content from your server, just the related Extract refresh.

def deleteExtractTasks(task_id, xsrf_token):
    payload = "{\"method\":\"deleteExtractTasks\",\"params\":{\"ids\":[\"%s\"]}}" % (task_id)
    endpoint = "deleteExtractTasks"
    url = tab_server_url + "/vizportal/api/web/v1/"+endpoint
    headers = {
    'content-type': "application/json;charset=UTF-8",
    'accept': "application/json, text/plain, */*",
    'cache-control': "no-cache",
    'X-XSRF-TOKEN':xsrf_token
    response = session.post(url, data=payload, headers=headers)
    if response.status_code != 200:
        print "Failed to delete extract schedule"
    else:
        print "Extract schedule successfully removed"
    return response

Emailing Users

Outside of cleaning up our server, letting our users know why we’re doing this is the next most important item. For that we’ll send them an email detailing which content is being removed, the reason why and an attachment containing the .twb or .tds file. You can also include a 2nd attachment using the thumbnail image of the Workbook (if available).

import os, sys
from os, import path
import smtplib, ssl
import mimetypes
import email
import email.mime.application
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email.mime.text import MIMEText
from email.utils import formatdate
from email import encoders

run_mode = tab_server_settings['run_mode']
smtp_server = email_settings['smtp_server']
smtp_user = email_settings['smtp_user']
smtp_pw = email_settings['smtp_pw']
test_email = email_settings['test_email']

# Create a text/plain message
def send_email(From, To, Subject, Reason, FilePath="", FileName="", DaysSince=""):
    if run_mode == 'live':
        msg = email.mime.Multipart.MIMEMultipart()
        msg['Subject'] = Subject
        msg['From'] = From
        msg['To'] = To

        file_path = FilePath
        file_name = FileName
        if Reason == 'Remove':
            email_body = "Dear %s,\n\n Your content %s has been removed from Tableau Server because it has not been used in %s days. If you feel this has been removed in error, please notify your Tableau Server Admin or republish your content (attached).\n\n" % (To, FileName, DaysSince)
        elif Reason == 'Disable':
            email_body = "Dear %s,\n\n Your content %s has been disabled from updating on Tableau Server because it has not been used in %s days. If you feel this has been done in error, please notify your Tableau Server Admin or reschedule your content.\n\n" % (To, FileName, DaysSince)

        # The main body is just another attachment
        body = email.mime.Text.MIMEText(email_body)
        msg.attach(body)

        if Reason == 'Remove':
            part = MIMEBase('application', "octet-stream")
            part.set_payload(open(file_path+file_name, "rb").read())
            encoders.encode_base64(part)
            part.add_header('Content-Disposition', 'attachment; filename='+file_name)
            msg.attach(part)

        # send via SMTP server
        s = smtplib.SMTP(smtp_server)
        s.starttls()
        s.login(smtp_user,smtp_pw)
        s.sendmail(msg['From'],msg['To'], msg.as_string())
        s.quit()
        logging.info("Email sent to %s." % (To))
    else:
        print "DEMO MODE:Send email to %s with attached file: %s " % (To, FileName)


Connecting all of the Dots…

Now that we’ve defined all of our functions, it’s time to connect the dots and automate our process. To avoid scanning through all the content on the server and to make the script as efficient as possible, I use a Class to determine what action should be taken for each piece of content that meets the criteria we defined in tab_server_settings and schedule_settings.

Determine the Appropriate Action

class TableauContent(object):
"""
    Attributes:

    Required for Init:
    ------------------
    content_name: Name of the workbook or datasource
    content_type: workbook or datasource
    content_id: the unique id associated with the workbook or datasource
    task_id: the task_id related to the Extract refresh of the workbook or datasource
    current_schedule: the current schedule type (Hourly, Daily, Weekly, Monthly) of the Extract refresh
    current_schedule_id: the id associated with the current schedule

    Optional:
    --------
    new_schedule: the schedule type to change the Extract refresh to (Hourly, Daily, Weekly, Monthly)
    new_schedule_id: the id associated with the new schedule
    days_since: the days since a workbook was last viewed/datasource was last accessed. Default value is Zero if not supplied during Init
    action_needed: the action needed based on the criteria specified in sheriff_config.schedule_settings

"""
def __init__(self, content_name, content_type, content_id, site_id, task_id, current_schedule, current_schedule_id, new_schedule=None, new_schedule_id=None, days_since=0, action_needed=None):
    """
    Returns a Content object whose name is *content_name*
    """
    self.content_name = content_name
    self.content_type = content_type
    self.content_id = content_id
    self.site_id = site_id
    self.task_id = task_id
    self.current_schedule = current_schedule
    self.current_schedule_id = current_schedule_id
    self.new_schedule = new_schedule #Default to None
    self.new_schedule_id = new_schedule_id #Default to None
    self.days_since = days_since #Default set to Zero
    self.action_needed = action_needed #Default to None

def action(self):
    """
    Returns an expected action based on the days since a Workbook as last viewed or a Datasource was last accessed.
    """
    # Scan the schedule_settings from the imported config to find the action needed
    for k in sorted(schedule_settings.keys(),reverse=True):
        if self.days_since >= k:
            # Get the correct setting based on days_since
            ss = schedule_settings.get(k)
            # If there is more than one setting, pull the entry that matches the current_schedule
            if len(ss.keys()) > 1:
                for key,value in ss.items():
                    if self.current_schedule == key:
                        cs = key
                        ns = value
            # Otherwise use the entry that aligns with the days_since
            else:
                cs = ss.keys()[0]
                ns = ss.values()[0]
                action = ss.values()[0]

            break
        # If days_since does not exceed any limits in the config, apply the current settings and Do Nothing
        else:
            ns = 'Do Nothing'
            cs = self.current_schedule

    # Set the Action Needed to be Performed
    if ns == 'Disable':
        self.action_needed = 'Disable'
        self.new_schedule = None
        self.new_schedule_id = None
    elif ns == 'Do Nothing':
        self.action_needed = 'Do Nothing'
        self.new_schedule = self.current_schedule
        self.new_schedule_id = self.current_schedule_id
    else:
        self.action_needed = 'Reschedule'
        self.new_schedule = ns

    print "The action_needed for %s has been set to %s" % (self.content_name, self.action_needed)
    return self.action_needed

What this does is interpret the values from our schedule_settings to come up with an appropriate action based on the guidelines below. You can customize this by changing your yaml settings but I’ve found this to be good enough for even the largest servers.

Time Since Last Viewed/Accessed? Current Schedule Recommended Action
6 Months+ Any Disable the Schedule
3 Months+ Any Move to Monthly
1 Month+ Weekly Move to Monthly
1 Month+ Hourly/Daily Move to Weekly
2 Weeks+ Any Move to Weekly
1 Week+ Hourly Move to Daily
2 Days+ Hourly Move to Daily

Designing the Workflow

All that’s left is to create our workflow and connect our functions. Here’s one you can follow for Workbooks.


Let’s Recap

There’s a bit more work needed to get everything running, but you should now have all the pieces needed to build a robust Content Management Tool for Tableau Server. Let’s recap what we’ve built:

  • We can now identify Workbooks and Datasources that have not been viewed/accessed for a given number of days. In our example, 180 days (6 Months).
  • We have all of the data and functions needed to loop through the Sites on our server and take the necessary action for a Workbook or Datasource.
  • We know who all of the content owners are and the email addresses associated with their server user names and built a mechanism for emailing them their content.
  • We have the list of all schedules available on Tableau Server and a way to modify the schedules related to our extracts.

If you made it this far, thanks for following along. This was a really fun tool to build and I hope you learned as much as I did. I’m excited to hear about how you’ll use this knowledge to build your own tools or automate your own Tableau Server needs. If you have any questions you can reach me through any of the channels listed here on my blog.

Happy Coding!

Leave a Comment