Media set scanning [Beta]

Beta

Media set scanning in Sensitive Data Scanner is in the beta phase of development and may not be available on your enrollment. Functionality may change during active development. Contact Palantir Support to request access to media set scanning in Sensitive Data Scanner.

Sensitive Data Scanner (SDS) can scan media sets for data matching a particular regex. SDS will convert the media items in a media set to text, and then run the regex against the extracted text. The text extraction method used will depend on the type of the media set being scanned.

Text extraction methods are as follows:

Media sets can only be scanned with content-only regex match conditions.

Issue match actions will automatically be aggregated. This means that for a given media set, even if there are multiple media items that match a given match condition, only a single issue will be opened on the media set.

Text extraction limitations

OCR and audio transcription may not produce exact replicas of the text in the original media content. For example, OCR may split a single word into two strings, capitalize letters, or incorrectly extract text from images that do not contain text. This can lead to unexpected behavior when matching against a regex, especially if the regex assumes that text will conform to certain formatting or capitalization rules.

To see the text that SDS ran a regex against, you can create a Pipeline Builder pipeline that takes a media set as input and applies the following transforms for media set types: