Class TransformsExcelParser
- All Implemented Interfaces:
Serializable
ParseResult including
error details, decryption success/failure details, and one or more dataframes.
A user of this library will generally construct exactly one instance of this class as part of their transform code.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classA builder for constructing an instance ofTransformsExcelParserwith customized settings and/or multiple outputs. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbuilder()Create a builder for constructing an instance ofTransformsExcelParserwith customized settings and/or multiple outputs.protected final voidcheck()protected BooleanA default-true setting that controls whether a_file_modified_timestampcolumn should be included in output dataframes.A mapping between keys (arbitrary strings) and their associatedParser.protected IntegerA setting that needs to be set to a large value in order to open large Excel files.protected DoubleA setting that controls the lowest acceptable size ratio of compressed to uncompressed files when attempting to open an xlsx or xlsm file (these file types are actually zip archives).static TransformsExcelParserCreate aTransformsExcelParserwith default configuration from a singleParser.static TransformsExcelParserof(Parser parser, PasswordProvider passwordProvider) Create aTransformsExcelParserwith default configuration from a singleParserandPasswordProvider.final ParseResultparse(org.apache.spark.sql.Dataset<com.palantir.spark.binarystream.data.PortableFile> files) Process the input dataset and return aParseResult.protected abstract Optional<PasswordProvider>A function to provide a set of passwords to try, given a workbook.
-
Constructor Details
-
TransformsExcelParser
public TransformsExcelParser()
-
-
Method Details
-
keyToParser
A mapping between keys (arbitrary strings) and their associatedParser. When there is exactly one Parser, the key does not matter, because you can retrieve the result of parsing from theParseResult.singleResult()method without considering the key. -
passwordProvider
A function to provide a set of passwords to try, given a workbook. -
maxByteArraySize
A setting that needs to be set to a large value in order to open large Excel files.The default value used by this library is Integer.MAX_VALUE, and consumers should almost never change that, because failing to process large files is not desirable in most pipelines.
-
minInflateRatio
A setting that controls the lowest acceptable size ratio of compressed to uncompressed files when attempting to open an xlsx or xlsm file (these file types are actually zip archives).This parameter is used in Apache POI to detect zip bombs (malicious files that when uncompressed can be much larger than they appear compressed). In practice, Excel files with a high compression ratio are rarely actually zip bombs, so we set this to an arbitrarily low value by default (0.000000000000001) instead of the usual Apache POI default of 0.01 (a 100x compression ratio).
-
includeFileModifiedTimestamp
A default-true setting that controls whether a_file_modified_timestampcolumn should be included in output dataframes.This column is useful for cases where input files can be changed and processing downstream should be incremental. The combination of
_file_pathand_file_modified_timestampcan then be used as a key to identify records from a unique instance of a file at a moment in time. -
check
@Check protected final void check() -
of
-
of
Create aTransformsExcelParserwith default configuration from a singleParserandPasswordProvider. This is a convenience method for when there is only one output and configuration options do not need to be customized. Otherwise, usebuilder(). -
builder
Create a builder for constructing an instance ofTransformsExcelParserwith customized settings and/or multiple outputs. -
parse
public final ParseResult parse(org.apache.spark.sql.Dataset<com.palantir.spark.binarystream.data.PortableFile> files) Process the input dataset and return aParseResult.Because this method takes a
Dataset<PortableFile>and not aFoundryInputas input, it is the responsibility of the consumer to implement incremental processing as appropriate (this method is agnostic with respect to whether it is called within an incremental or a snapshot pipeline).
-