Skip to main content

Optical Character Recognition (OCR)

What is OCR?

Optical Character Recognition, or OCR, is a technology used to recognize and convert printed or handwritten text characters from scanned paper documents, PDF files, images or video into machine-encoded text. This conversion process enables digital manipulation of the content, facilitating tasks such as text searching, editing, storage and, and digital archiving. In the context of Koverse, it is used for storing non-encoded text from these sources into datasets, as well as text searching.

Usage

Koverse' OCR functionality is only available for use via the API, only compatible with S3 and URL data sources, and is disabled by default due to the processing overhead that OCR incurs, potentially resulting in a slowed ingest. You can however, enable OCR in the API call via the connectionInfo attribute, by setting the processWithOcr boolean to true enabling text recognition processing on the ingested files. Once processing is complete, searches will run against the OCR extracted text as well, allowing you to search within images and files that, without OCR, would not be searchable. One of the options for interacting with Koverse via API is the Koverse Python Connector to abstract away some of the complexity.

Feel free to explore the API Reference for additional API details.

HTTP Response Codes

When working with the API, you will encounter HTTP response codes. These codes can have a range of values, but it is relatively straightforward, and can be understood by reviewing the below table:

Status CodeCategorySummary
201SuccessfulCreated: Indicates the request has succeeded and led to the creation of a new resource.
4XXClient ErrorThese codes indicate an error that the client made, such as a bad request or an unauthorized attempt to access a resource. i.e. 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found
5XXServer ErrorThese codes indican an error on the server's side, meaning the server failed to fulfill a valid request. i.e. 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout

Authenticate

Email and Password Authentication

This section will provide guidance corresponding to the Authentication for Email and Password section of the Koverse API Reference.

To authenticate via email and password, you'll need to send the properly formed request to the standard authentication API endpoint https://api.app.koverse.com/authentication that provides the strategy (enum or proxy), the email address, the password, and the workspaceId you would like to authenticate against. You'll receive a response that will either confirm successful authentication (returning the accessToken, as well as the authentication and user objects), or in the event a request fails and you encounter an error in your response, it will contain the name of the error code(s), the specific error message, as well as the HTTP response status code.

Example Email Auth Request:

{
"strategy": "local",
"email": "string",
"password": "string",
"workspaceId": "string"
}

Example Error Response:

{
"name": "string",
"message": "string",
"code": 0
}

Example Successful Response:

{
"accessToken": "string",
"authentication": {
"accessToken": "string",
"payload": {
"email": "string",
"exp": 0,
"iat": 0,
"iss": "string",
"jti": "string",
"sub": "string"
},
"strategy": "string"
},
"user": {
"avatar": "string",
"changeEmailTokenExpiration": "string",
"createdAt": "string",
"deletedAt": "string",
"displayName": "string",
"email": "string",
"firstName": "string",
"githubId": "string",
"googleId": "string",
"id": "string",
"lastName": "string",
"linkedAccounts": [
"string"
],
"microsoftId": "string",
"oktaId": "string",
"stripeCustomerId": "string",
"updatedAt": "string",
"verified": true,
"workspaceCount": 0
}
}

SSO Authentication

This section will provide guidance corresponding to the Authentication for SSO section of the Koverse API Reference.

To authenticate via SSO, you must send the properly formed request to the SSO authentication API endpoint https://api.app.koverse.com/authentication?workspaceId={workspaceId} replacing {workspaceId} with the corresponding ID to your target workspace. You will need to set the corresponding authentication strategy as one of keycloak, microsoft, google, github, okta, or custom, as well as the access_token that is retrieved by signing into the SSO account itself. If the request is successful, you will receive both an authentication and user object containing the necessary details to conduct subsequent authenticated requests against the API endpoint.

Example SSO Auth Request:

{
"strategy": "keycloak",
"access_token": "string"
}

Example Error Response:

{
"name": "string",
"message": "string",
"code": 0
}

Example Successful Response:

{
"accessToken": "string",
"authentication": {
"accessToken": "string",
"payload": {
"email": "string",
"exp": 0,
"iat": 0,
"iss": "string",
"jti": "string",
"sub": "string"
},
"strategy": "string"
},
"user": {
"avatar": "string",
"changeEmailTokenExpiration": "string",
"createdAt": "string",
"deletedAt": "string",
"displayName": "string",
"email": "string",
"firstName": "string",
"githubId": "string",
"googleId": "string",
"id": "string",
"lastName": "string",
"linkedAccounts": [
"string"
],
"microsoftId": "string",
"oktaId": "string",
"stripeCustomerId": "string",
"updatedAt": "string",
"verified": true,
"workspaceCount": 0
}
}

Create an Ingest Job

This section will provide guidance corresponding to the Create an Ingest Job section of the Koverse API Reference.

To create an ingest job for the purpose of importing data into Koverse, you must send the properly formed request to the ingest API endpoint https://api.app.koverse.com/ingest. You will need to set the desired target datasetId that is available from within your authenticated workspace, as well as the attributes/parameters for the dataSourceParams via the following elements:

  • type: the type of the datasource (Enum: "URL" "JDBC" "S3" "KAFKA" "OTHER"
  • connectionInfo: contains the configuration parameters for making a connection to ingest data
  • securityLabelInfo: contains the configuration parameters for the security label parser
  • securityLabeled: boolean, set to true if ingest source data contains security labels

You d

Example Ingest Job Request:

{
"datasetId": "6586f21b-ad4d-4d06-a309-712af47184a2",
"dataSourceParams": {
"type": "URL",
"connectionInfo": {
"urls": "- \"https://kisp-test.s3-us-west-2.amazonaws.com/nightly_tests/test.csv\"\n- \"https://kisp-test.s3-us-west-2.amazonaws.com/nightly_tests/test.xls\"\n",
"processAsDocument": true,
"processWithOcr": false
},
"securityLabelInfo": {
"fields": [
"string"
],
"label": "string",
"labelHandlingPolicy": "ignore",
"parserClassName": "simple-parser",
"replacementString": "string"
},
"securityLabeled": true
}
}

Example Error Response:

{
"name": "string",
"message": "string",
"code": 0
}

Example Successful Response:

{
"name": "string",
"message": "string",
"code": 202
}