Prototyping Analytics

In addition to running transforms to process data sets at scale, Koverse also enables users to perform interactive analysis of data sets at scale via popular tools such as Apache Spark and Zeppelin Notebook and Spark shells, including pyspark, spark-shell, and sparkR. Users may want to retrieve and analyze part or a whole data set. As mentioned in the “Indexing and Search Guide” users may query the whole system or a set of datasets this can be performed in notebooks and shells. Many of the examples here will be from a dataset that can be downloaded from here, follow the instructions to upload a dataset to put this into Koverse. https://s3.amazonaws.com/koverse-datasets/financial+demo/employees.csv

Analytics Setup

Please reference the Optional Installation Guide to setup each of the following. The python client and Koverse Spark Data Sourse must be installed before performing these exercises.

  • Python Client

  • Koverse Spark Data Source

  • Zeppelin

API Token

For these examples you should also have an API Token, setting this up is as described in the API Tokens section it should be something like the following image. These examples will be referencing the token. Note the hostname and the apiToken.

  • hostname: The FQDN of the server running the koverse-server process, this could be ‘localhost’

  • apiToken: A Koverse API Token that will have the required access to the Data Set being loaded. You can create API Tokens via the Koverse Admin UI. Use the API Token UUID, not its name.

    ../_images/zeppelinAPIToken2.png

Launching Pyspark

In this document we will discuss using Pyspark but these examples can be performed in a notebook as well. An example of starting the Python Spark shell is seen below, note the datasource should be changed to the data source version and the location where you downloaded the jar file for the Koverse Spark Data Source:

$ pyspark --jars /home/koverse/<your path>/koverse-spark-datasource-3.2.6.jar

You can also run using the nexus.koverse.com repository.

$ pyspark --repositories http://nexus.koverse.com/nexus/content/groups/public/ --packages com.koverse:koverse-spark-datasource:3.2.6

Note that at this point the SparkContext sc and the SQLContext sqlContext are initialized. To load a Koverse Data Set into a DataFrame:

>>> df = sqlContext.read.format('com.koverse.spark').options(hostname='<your koverse fqdn>', apiToken='<your api token>').load('<your data set name>')
>>> df = sqlContext.read.format('com.koverse.spark').options(hostname='localhost', apiToken='11111111-11111-1111-1111111111111111').load('employees')

Now you have access to the Koverse Data Set via the Spark DataFrame API.

Connecting to the Koverse Server

Note that at this point the SparkContext sc and the SQLContext sqlContext are initialized. Koverse ships with a Python client to allow Python scripts to access the Koverse API. The Koverse Python client uses Apache Thrift to communicate with the Koverse server. It is possible to generate clients for other languages as well. Follows instructions in Optional Installation Guide to get python client set up. The Koverse Python client can then be used in Python scripts by importing the koverse module:

>>> from koverse import client

The Python client can connect to the hostname of the Koverse Server (note: this is not the address of the Koverse Web App) and the authenticatorId and apiTokenId have been created in Koverse:

from koverse import client
from koverse.thriftgen.security.ttypes import TAuthInfo
client.connect('localhost')
client.auth = TAuthInfo()
client.auth.authenticatorId = 'zeppelin'
client.auth.apiTokenId = '11111111-11111-1111-1111111111111111'
client.auth.externalTokens = []
client.auth.externalGroups = []

If the authentication is unsuccessful an exception is raised:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/koverse/client.py", line 93, in authenticateUser
  tUser = ugClient.authenticateUser(auth, None, parameters)
File "/Library/Python/2.7/site-packages/koverse/thriftgen/usergroup/UserGroupService.py", line 782, in authenticateUser
  return self.recv_authenticateUser()
File "/Library/Python/2.7/site-packages/koverse/thriftgen/usergroup/UserGroupService.py", line 807, in recv_authenticateUser
  raise result.ke
koverse.thriftgen.ttypes.TKoverseException: TKoverseException(_message='No authenticated user found')

Querying Koverse Data Sets

The Koverse Python Client can be used to interactively query data sets, fetch samples, create data sets and run transforms. For additional information about queries see Indexing and Search Guide

To query one or more data sets, use the client’s query() method. In this example, we’ll query Koverse for any data set that has a value “INTR-345” in a field named ‘name’.

>>> results = client.query({'orgCode': 'INTR-345'})
>>> print("Datasets Containing query",len(results))
>>> print("Number of records in first dataset",len(list(results[0].records)))
Datasets Containing query 1
Number of records in first dataset 19

Results are returned as a list of Python dicts, each representing a record from a Koverse data set:

 >>> import pprint
 >>> pprint.pprint(list(results[0].records)[0])
 {'businessUnit': 'IN',
'hireDate': time.struct_time(tm_year=45199, tm_mon=12, tm_mday=31, tm_hour=0,   tm_min=3, tm_sec=20, tm_wday=4, tm_yday=365, tm_isdst=0),
'name': 'James Barlow',
'orgCategory': 'INTR',
'orgCategoryDescription': 'Investments - Trading',
'orgCode': 'INTR-345',
'traderId': 'TRD0050299'}

Koverse records contain fields and values. Values may be of a simple type such as int and date, but may also contain lists or dicts.

To query a specific set of data sets, specify an optional parameter with a list of data set names to query:

>>> results = client.query({'orgCode': 'INTR-345'}, ['employees'])

or, by using the name parameter ‘datasets’:

>>> client.query({'Close': {'$gt': 100.0}}, datasets=['stocks'])

Clients can also request that the results be limited to a set number, and can request that the Koverse server deliver results beginning at a specified offset. For example:

>>> client.query({'Close': {'$gt': 100.0}}, datasets=['stocks'], limit=10, offset=100)

Clients can also request that the Koverse Server return only a subset of the fields in each record by specifying a list of field names to include:

>>> pprint.pprint(client.query({'Close': {'$gt': 100.0}}, data sets=['stocks'], limit=10, offset=100, fields=['Close']))
  [{'Close': 110.88},
  {'Close': 111.56},
  {'Close': 111.25},
  {'Close': 110.75},
  {'Close': 111.63},
  {'Close': 111.25},
  {'Close': 111.5},
  {'Close': 111.25},
  {'Close': 111.5},
  {'Close': 111.5}]

Fetching Data Set Samples

Koverse data sets may contain a large volume of records, it can be useful to work with a sample of a data set’s records, especially when building statistical models designed to be trained on a representative sample. Koverse maintains representative samples for all data sets by default. These samples can be retrieved by the client using the getSamples() method:

 >>> samples = client.getSamples(dataset = 'employees', maxRecords=100, removeByteArrays=False, maxStringLength=100)
 >>> pprint.pprint(list(samples))
 [{'businessUnit': 'IN',
 'hireDate': time.struct_time(tm_year=40242, tm_mon=4, tm_mday=22, tm_hour=20, tm_min=3, tm_sec=20, tm_wday=4, tm_yday=112, tm_isdst=0),
 'name': 'Charde Moore',
 'orgCategory': 'INTR',
 'orgCategoryDescription': 'Investments - Trading',
 'orgCode': 'INTR-345',
 'traderId': 'TRD0050376'},
{'businessUnit': 'PB',
 'hireDate': time.struct_time(tm_year=40008, tm_mon=8, tm_mday=22, tm_hour=16, tm_min=20, tm_sec=0, tm_wday=4, tm_yday=235, tm_isdst=0),
 'name': 'Evelyn Carr',
 'orgCategory': 'PBCS',
 'orgCategoryDescription': 'Private Banking - Client Services',
 'orgCode': 'PBCS-67',
 'traderId': 'TRD0050130'},

Spark Shell (Scala)

An example of starting the Scala Spark shell is seen below:

$ spark-shell --repositories http://nexus.koverse.com/nexus/content/groups/public/ --packages com.koverse:koverse-spark-datasource:3.2.6

Like with the PySpark shell, at this point the SparkContext sc and the SQLContext sqlContext are initialized. To load a Koverse Data Set into a DataFrame:

scala> val df = sqlContext.read.format("com.koverse.spark").option("hostname", "<your koverse fqdn>").option("apiToken", "<your api token>").load("<your data set name>")

SparkR

An example of starting the R Spark shell is seen below. Note, this has the prerequisite of the R runtime already being installed:

$ sparkR --repositories http://nexus.koverse.com/nexus/content/groups/public/ --packages com.koverse:koverse-spark-datasource:3.2.6

To load a Koverse Data Set into a DataFrame:

df <- read.df(sqlContext, "com.koverse.spark", hostname="<your koverse fqdn>", apiToken="<your api token>", path="<your data set name")

Apache Zeppelin

Apache Zeppelin is a notebook tool that allows developers to create code and comments in an interactive manner without requiring a full development environment. Refer to the Optional Installation Guide to install Zeppelin. You can now proceed with creating Zeppelin notebooks that access Koverse. Simply create a new notebook and then create a new data frame using the Koverse Spark Data Source as follows:

// Connect to the Koverse Spark Data Source on localhost and create a data frame using the data set named "employees"
val df = sqlContext.read.format("com.koverse.spark").option("hostname", "localhost").option("apiToken", "99ff62de-42ac-4b8b-b7dd-79b02bb50da2").load("employees")

Writing Data to Koverse

The examples above describe fetching data, or querying a dataset. Koverse also provides the ability to write back to a new dataset. This allows you fetch data, transform it, then write it back. The paragraphs below are inside a Zeppelin, but they could also be done in Pyspark or other shells. The first paragraph is retrieving data from the employees table the same way it was done in examples above.

../_images/baseDataFrame.png

Next this paragraph shows manipulating this dataframe and selecting a subset of the columns, and adding one column.

../_images/newDataframe.png

Then just as you retrieved data from Koverse it can be easily written to Koverse as well, note the mode:

%pyspark
df2.write.format('com.koverse.spark').options(hostname='localhost', apiToken='11111111-11111111-11111111-11111111').mode('Overwrite').save('employeesSalary')

Now this can be used inside of Koverse, and if indexing is set it can be queried like any other dataset.

../_images/koverseScreen2.png

SQL on Koverse

Often a user may need to do some Spark SQL on a dataset inside a notebook like Zeppelin. In the two paragraphs below you can see how a dataset can be registered then used in a sql query generating a Pie Chart.

First register the temp table, this is using the df1 that was created above:

%pyspark
df1.registerTempTable("EmployeeData")

Then add a SQL query and using the built in functionality to create a pie chart.

../_images/piechart.png