General background information
amberSearch already supports a variety of well-known System/Data sources, whose connection to amberSearch works out of the box with minimal configuration efforts. However, if you wish to expand the data sources you want to search with amberSearch, you can make use of our generalized REST API-based approach, in order to load the document of your In-House managed data source to amberSearch and make it accessible to the employees of your company. In the end, the customers are responsible for developing and maintaining client-side jobs that consume the APIs as described in the following sections.
Supported documents types
You can get the best out of amberSearch if you want to intelligently search through unstructured data. The following types are currently supported:
- Regular files (Pdf, Word, Excel, Powerpoint, etc.)
- Messages (Like e-mails in any format, e.g .eml, .msg, or messages from messaging Tools)
- Wiki (of any kind, e.g GitLab Wiki, OneNote, etc)
- Images
- Video and Audio - To be done
API specifications and data formats
Uploading documents to amberSearch
POST /upload_data
The payload must contain the following keys independently from the document type:
"id": String, Unique identitfier. Must begin with the name of the custom data source, followed by the identifier in the scope of this data source, e.g sql_database1_123456789.
"title": String, The title of the document,
"data_source": String, The name of the data source, should be the same for all documents coming from the same source,
"author": Strings, Name(s) of the user(s) who has/have modified the document as last one, defaults to [],
"last_modified": Date, timestamp of the last modification applied on the document, in the format %Y-%m-%dT%H:%M:%SZ
"document_type: file/message/wiki/image/vide/audio.
"allow_token_document": Strings, authorization tokens, which grant users access to the document
If present and meaningful in the context of your custom data source, it is recommended to additionally pass the following keys as well:
"path": String, recommened, if there is a directories structure present in your data source,
"parent_path": String, recommended if "path" is passed,
"knowledge_str_created_date": Date, indicating when the document was created
"knowledge_str_created_author": String, indicating the creator of the document
However, if custom metadata are relevant, you can create the key on your own as long as they respect the following patterns:
konwledge_str_* : For storing one single string value
knowledge_strings_* : For storing multiple string values in an array
knowledge_int_* : For storing one single integer value
knowledge_ints_* : For storing multiple integer values in an array
knowledge_float_* : For storing one single float value
knowledge_floats_* : For storing multiple float values in an array
knowledge_date_* : For storing one single date value in the format %Y-%m-%dT%H:%M:%SZ
knowledge_dates_* : For storing multiple date values in an array
Additional key needed for documents with document_type=message
knowledge_str_thread_id: String, This id groups the messages based on the conversation/thread id
How to handle text extraction from your documents:
The power of amberSearch consists in intelligently understanding the semantics of unstructured textual or visual contents contained in the documents. The following two ways are possible:
- Recommended! If possible, also pass the binaries of the relevant file alongside the JSON payload from above. The backend of amberSearch will then fully take over the content extraction. Throttling limits should be discussed individually with the IT team of amberSearch it@ambersearch.de
- Alternatively, if you prefer to handle text extraction on the client side, you can pass the text to the JSON payload with the key "text". The limitation of 2 million characters must be respected!
Further recommendations:
- For visual/audio content (images, videos, audio files) the binaries should be posted together with the payload, if no metadata with semantic meanings is provided in the JSON payload.
Deleting documents from amberSearch
It is the job of the client-side script to inform amberSearch about documents that are deleted from the data source. This way, we can keep the index up to date.
POST /delete_data
Payload:
id: The unique identifier of the document
Updating documents
If you want to update already indexed documents, just call the /upload_data endpoint. amberSearch backend overwrites the documents in the index as long as the id is persistent.
Authorization
There are two ways of handling authorization:
- If all amberSearch users should be eligible to find the documents from the data source, then set "allow_token_document" to ["everyone"]
- Setting up granular authorization levels depends on the identity providers that are in place in the customers' environment. amberSearch supports Oauth2.0, OpenId Connect, Saml, and LDAP. Further details should be discussed with the IT team (it@ambersearch.de)
Examples:
Let's assume you have a SQL database that contains columns with unstructured data. In this scenario, we are considering a SQL database that stores various items. Among them, there are many electronic items. Additionally, we store a reference to a Pdf file, that contains a further description of the items, that is stored somewhere else within the company. In the following, find two examples of API calls:
POST /upload_data -F "article_nr_12345.pdf"
{
"id": "sql_database_inhouse_article_nr_12345",
"title": "Notebook - DELL XPS",
"data_source": "sql_database_inhouse",
"data_sourec_sub": "sql_database_inhouse__electronic_items"
"author": ["Max Mustermann"],
"last_modified": "2022-03-28T15:14:35Z",
"document_type: "file",
"knowledge_str_created_date": "2022-01-21T15:14:35Z",
"knowledge_str_created_author": "Eva Musterfrau",
}
POST /delete_data
{
"id: "sql_database_inhouse_article_nr_12345"
}
Comments
0 comments
Please sign in to leave a comment.