Exploring a Data Set

To explore summary information about a data set, click on the ‘Data’ button on the primary navigation menu on the left and select a data set from the list to the right of the navigation menu.

Viewing Data Set Attributes

The information in the ‘Overview’ tab is shown first, which displays a summary of all the attributes in this data set. If the Data Set has a job that is currently running, that information will be displayed here as well.

../_images/datasets.png

When any data is imported, Koverse automatically profiles the incoming records and keeps track of information about individual attributes. Information about each of these attributes is displayed here including:

  • the attribute name
  • the number of records in which it is present
  • an estimate of the number of unique values found for this attribute
  • the predominant value type
  • a visualization of the distribution of values

To see the associated visualization for an attribute, click the down arrow at the right of the attribute information.

../_images/attributeVisualization.png

This information can help you get a sense for what kind of information a particular data set contains, and can help identify potential opportunities for answering questions using this information either in searches or in analytics, as well as any data quality issues that might exist. For example, as a data scientist I might be interested to find out which attributes in a data set contain text that I can process to extract a sentiment score. Or I may be interested in finding out what fields contain customer IDs so I can join this data set with another data set.

If I see that a field isn’t present in all the records, or of not 100% of the values are of the same time, it may be because there are data quality or consistency issues, or it may be another feature of the data that may need to be considered. For example, not all Twitter messages contain hashtags, and I can get a sense for what proportion do from the information in this overview.

For example after loading the first example data set as described in the Importing Data section, you should be able to select the ‘Bank Security Incidents’ data set to see a list of attributes.

We may not know much about the information contained in a data set and this view helps us figure out what the likely meaning of each attribute is.

For example, the first attribute is called ‘causeType’. In the context of ‘Bank Security Incidents’ we may infer that this contains some information about the cause of each incident.

The presence count for this attribute should be 49,894 out of 49,894 records, so this attribute is present in every record.

The estimated number of unique values for this attribute is 7, so out of almost 50 thousand records we’ve only ever seen 7 unique values.

The data type is 100% Text, which means in every record the type of the value for the ‘causeType’ attribute is ‘Text’. Sometimes an attribute will not always have the same data type in every record.

Clicking on the down arrow by the ‘Visual’ column will show us a visualization of the top most frequent values for this attribute. In this case Koverse automatically selected a bar chart to display a histogram of the most frequent values. For example, the ‘Infrastructure’ value showed up in this attribute 3,857 times. Placing your mouse over a column will display the exact number of records for each value.

Clicking on the up arrow at the top of the visualization will collapse this view again. Scrolling down allows us to see other attributes.

Viewing Sample Records

To view records of a data set, click on the ‘Data’ tab. Initially, you will see a representative sample of the records in this data set. This sample is maintained as new data is added so that it represents a subset of records sampled uniformly at random.

You can also perform a search to see records matching specific criteria.

Downloading Search Results

When viewing search results for a single data set, the full set of results can be downloaded using the ‘Download Results’ button, as either a CSV file or a JSON file.

CSV files can be loaded into many other tools such as Microsoft Excel and Tableau, and is a good choice when records consist of simple values and don’t have nested lists or other structures. JSON is a good choice for records that have complex values such as lists and lists of field-value pairs.

../_images/downloadSearchResults.png

For example by clicking the ‘Download Results’ button on our search of Velma’s trade transactions we can choose to download all the results as either a CSV file or a JSON file. Choose CSV and click ‘Download’.

Your browser will start downloading a file that starts with the phrase ‘bank_trade_transactions’ and ends in ‘’.csv’.

Once this is downloaded you can open it in a 3rd party application such as Microsoft Excel.

For more examples in working with this bank data, see the Configuring Transforms section.

Downloading an Entire Data Set

To download all the records in a data set, click on the circular download button in the upper right corner of the data set detail page.

Records can be downloaded to your browser as a CSV file or a JSON file.

Note that if a data set may contain more records than can be stored on a single disk drive. For data sets with more than about a hundred million records or so it may not be possible to download the entire set to a desktop or laptop machine.

../_images/download.png