Data connectivity & integrationAvailable connectorsSMB

Server Message Block (SMB)

Connect to Server Message Block (SMB) shares to sync data between folders and Foundry datasets. Common examples of SMB servers include Windows File Server and Samba File Server.

Supported capabilities

Capability	Status
Exploration	🟢 Generally available
Bulk import	🟢 Generally available
Incremental	🟢 Generally available
File exports	🟢 Generally available

The SMB connector supports SMB protocol versions 2 and 3.

Data model

The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.

Performance and limitations

There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.

Setup

Open the Data Connection application and select + New Source in the upper right corner of the screen.
Select SMB from the Protocol sources section.
Choose to run the source capabilities on a Foundry worker or on an agent worker.
Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.

Learn more about setting up a connector in Foundry.

Configuration options

The following configuration options are available for the SMB connector:

Option	Required?	Description
`Hostname`	Yes	The domain name pointing to the server or the IP address of the server.
`Port`	No	The port on which the SMB server is running.
`Share`	Yes	The name of the SMB share you are connecting to.
`Username`	Yes	The SMB login username.
`Password`	Yes	The SMB login password.
`Domain`	No	The Active Directory domain of the SMB login account. Leave blank if the login account is not an AD user.

Networking

The SMB connector must be able to reach Hostname on Port (445 by default). If you are using a direct connection egress policy, you must use TCP-level allowlisting.

Sync data from SMB

The SMB connector uses the file-based sync interface.

Export data to SMB

To export to an SMB share, first enable exports for your SMB connector. Then, create a new export.

Export configuration options

Option	Required?	Default	Description
`Directory path`	Yes	/	The path to the folder in the SMB share where files should be exported. The full path for an exported file is calculated as `<Share>/<Directory Path>/<Exported File Path>`

Use SMB sources in code

You can connect to SMB shares from a Python transforms code repository using external transforms.

Read files and metadata from SMB

The example below demonstrates the minimal code needed to connect to an SMB share and read files:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import smbclient
import pandas as pd
from transforms.api import Output, transform
from transforms.external.systems import external_systems, Source

DOMAIN = "<smb_domain>"
USERNAME = "<smb_username>"
SMB_PATH = "<smb_path>"

@external_systems(
    smb_source=Source("ri.magritte..source.YOUR_SMB_SOURCE_RID")
)
@transform(
    file_metadata=Output("ri.foundry.main.dataset.YOUR_METADATA_OUTPUT_RID"),
    output_files_dataset=Output("ri.foundry.main.dataset.YOUR_FILES_OUTPUT_RID")

)
def read_smb_files(ctx, smb_source, file_metadata, output_files_dataset):
    """
    Read files from an SMB share and output files and metadata.
    """
    # Configure SMB client with credentials from source
    username = f"{DOMAIN}\\{USERNAME}"
    password = smb_source.get_secret("Password")
    smbclient.ClientConfig(username=username, password=password)

    # List files in a directory
    files_info = []

    for item in smbclient.scandir(SMB_PATH):
        if not item.is_dir():
            # Get file info
            stat = item.stat()

            file_path = f"{SMB_PATH}\\{item.name}"
            files_info.append({
                "filename": item.name,
                "safe_filename": create_safe_filename(item.name),
                "size_bytes": stat.st_size,
                "path": file_path
            })
            with smbclient.open_file(file_path, mode="rb") as f:
                content = f.read()
                safe_foundry_file_name = create_safe_filename(item.name)
                with output_files_dataset.filesystem().open(safe_foundry_file_name, 'w') as fileobj:
                    fileobj.write(content)

    # Write metadata to output
    if files_info:
        df = ctx.spark_session.createDataFrame(pd.DataFrame(files_info))
        file_metadata.write_dataframe(df)


def create_safe_filename(filepath: str) -> str:
    """
    Create a safe filename for storage by removing problematic characters.

    Args:
        filepath: Original file path

    Returns:
        Sanitized filename safe for storage
    """
    import os
    filename = os.path.basename(filepath)
    # Replace characters that might cause issues in storage
    safe_chars = str.maketrans({"\\": "_", "/": "_", ":": "_", "?": "_", "*": "_", "<": "_", ">": "_", "|": "_"})
    return filename.translate(safe_chars)

Read files from SMB and upload as media sets

The comprehensive example below demonstrates how to connect to an SMB share from a Python transform code repository and create media set outputs, including error handling, recursive directory scanning, and file categorization:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
"""
SMB File Processing Transform Template

This template demonstrates how to connect to an SMB share
from a Foundry Python transform to process files and organize them by type.

Key concepts covered:
- External systems integration with SMB sources
- Recursive directory traversal
- File categorization by extension
- MediaSet outputs for different file types
- Structured metadata collection
- Error handling and logging

Prerequisites:
- SMB source configured in Data Connection
- Input dataset with directory paths
- Output datasets/mediasets configured
- smbclient library available in the repository
"""

import logging
import os
from datetime import datetime

import pandas as pd
import smbclient
from transforms.api import Input, Output, transform
from transforms.external.systems import ResolvedSource, Source, external_systems
from transforms.mediasets import MediaSetOutput

# Configure logging

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)


@external_systems(
    smb_source=Source("ri.magritte..source.YOUR_SMB_SOURCE_RID")  # Replace with your SMB source RID
)
@transform(
    input_directories=Input("ri.foundry.main.dataset.YOUR_INPUT_DATASET_RID"),  # Dataset containing directory paths
    documents_mediaset=MediaSetOutput("ri.mio.main.media-set.YOUR_DOCUMENTS_MEDIASET_RID"),  # For PDF, DOC, PPT files
    images_mediaset=MediaSetOutput("ri.mio.main.media-set.YOUR_IMAGES_MEDIASET_RID"),  # For image files
    spreadsheets_mediaset=MediaSetOutput("ri.mio.main.media-set.YOUR_SPREADSHEETS_MEDIASET_RID"),  # For Excel files
    file_metadata=Output("ri.foundry.main.dataset.YOUR_METADATA_OUTPUT_RID"),  # Structured metadata output
)
def smb_file_processor(
    ctx,
    smb_source: ResolvedSource,
    input_directories,
    documents_mediaset,
    images_mediaset,
    spreadsheets_mediaset,
    file_metadata,
):
    """
    Process files from SMB directories and categorize them into different outputs.

    This transform:
    1. Reads directory paths from input dataset
    2. Connects to SMB share using configured source
    3. Recursively scans directories for files
    4. Categorizes files by extension
    5. Downloads and stores files in appropriate MediaSets
    6. Creates structured metadata for all processed files

    Args:
        ctx: Transform context
        smb_source: Configured SMB source from Data Connection
        input_directories: Dataset with 'path' column containing directory paths to process
        documents_mediaset: MediaSet for document files (PDF, DOC, PPT, TXT)
        images_mediaset: MediaSet for image files (JPG, PNG, etc.)
        spreadsheets_mediaset: MediaSet for spreadsheet files (XLSX)
        file_metadata: Structured dataset for file metadata and processing results
    """

    # Define file categories by extension
    DOCUMENT_EXTENSIONS = {".pdf", ".pptx", ".docx", ".txt", ".ppt", ".doc"}
    IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".webp", ".gif"}
    SPREADSHEET_EXTENSIONS = {".xlsx", ".xls", ".csv"}

    # SMB connection configuration
    # These values should match your SMB source configuration in Data Connection
    smb_config = {
        "hostname": "your-smb-server.domain.com",  # Replace with your SMB server hostname
        "share": "your-share-name",  # Replace with your SMB share name
        "username": "your-service-account",  # Replace with your SMB username
        "password": smb_source.get_secret("Password"),  # Password stored securely in source
        "domain": "your.domain.com",  # Replace with your domain (optional)
    }

    logger.info("=" * 80)
    logger.info("SMB FILE PROCESSOR - STARTING")
    logger.info("=" * 80)
    logger.info(f"Server: {smb_config['hostname']}")
    logger.info(f"Share: {smb_config['share']}")
    logger.info(f"Username: {smb_config['domain']}\\{smb_config['username']}")
    logger.info("=" * 80)

    def transform_directory_path(input_path: str) -> str:
        """
        Transform input directory path to SMB UNC format.

        Modify this function based on your specific path transformation needs.
        Example: Convert from local mount path to UNC path format.

        Args:
            input_path: Original directory path from input dataset

        Returns:
            SMB UNC path format (\\\\server\\share\\path)
        """
        # Example transformation - customize for your environment
        # Remove local mount prefix and convert to UNC format
        local_prefix = "/mnt/your-mount/path/"  # Replace with your local mount prefix

        if input_path.startswith(local_prefix):
            relative_path = input_path[len(local_prefix) :]
        else:
            # Handle alternative path formats
            logger.warning(f"Unexpected path format: {input_path}")
            relative_path = input_path

        # Convert to UNC format
        unc_path = f"\\\\{smb_config['hostname']}\\{smb_config['share']}\\{relative_path}"
        return unc_path.replace("/", "\\")  # Ensure Windows path separators

    def get_file_category(filename: str) -> str:
        """
        Determine file category based on file extension.

        Args:
            filename: Name of the file

        Returns:
            Category string: 'document', 'image', 'spreadsheet', or 'other'
        """
        ext = os.path.splitext(filename)[1].lower()

        if ext in DOCUMENT_EXTENSIONS:
            return "document"
        elif ext in IMAGE_EXTENSIONS:
            return "image"
        elif ext in SPREADSHEET_EXTENSIONS:
            return "spreadsheet"
        else:
            return "other"

    def create_safe_filename(filepath: str) -> str:
        """
        Create a safe filename for storage by removing problematic characters.

        Args:
            filepath: Original file path

        Returns:
            Sanitized filename safe for storage
        """
        filename = os.path.basename(filepath)
        # Replace characters that might cause issues in storage
        safe_chars = str.maketrans({"\\": "_", "/": "_", ":": "_", "?": "_", "*": "_", "<": "_", ">": "_", "|": "_"})
        return filename.translate(safe_chars)

    def get_all_files_recursive(directory_path: str) -> list:
        """
        Recursively scan directory and return all files with their metadata.

        Args:
            directory_path: SMB directory path to scan

        Returns:
            List of dictionaries containing file information
        """
        all_files = []

        try:
            # Scan directory contents
            items = list(smbclient.scandir(directory_path))

            for item in items:
                if item.is_dir():
                    # Recursively process subdirectories
                    subdirectory_path = f"{directory_path}\\{item.name}"
                    logger.info(f"Scanning subdirectory: {subdirectory_path}")
                    all_files.extend(get_all_files_recursive(subdirectory_path))
                else:
                    # Add file information
                    file_info = {"item": item, "directory_path": directory_path}
                    all_files.append(file_info)

        except Exception as e:
            logger.error(f"Error scanning directory {directory_path}: {str(e)}")

        return all_files

    # Initialize tracking variables
    metadata_records = []
    total_files_processed = 0
    files_by_category = {"document": 0, "image": 0, "spreadsheet": 0, "other": 0}

    try:
        # Configure SMB client authentication
        username = f"{smb_config['domain']}\\{smb_config['username']}"
        smbclient.ClientConfig(username=username, password=smb_config["password"])

        # Read input directories from dataset
        input_df = input_directories.dataframe()
        directory_rows = input_df.select("path").collect()
        directory_paths = [row.path for row in directory_rows]

        logger.info(f"Processing {len(directory_paths)} directories")

        # Process each directory
        for idx, directory_path in enumerate(directory_paths, 1):
            logger.info(f"[{idx}/{len(directory_paths)}] Processing: {directory_path}")

            try:
                # Transform to SMB path format
                smb_directory_path = transform_directory_path(directory_path)
                logger.info(f"SMB path: {smb_directory_path}")

                # Get all files recursively
                all_files_info = get_all_files_recursive(smb_directory_path)
                logger.info(f"Found {len(all_files_info)} files in directory tree")

                # Process each file
                for file_info in all_files_info:
                    file_item = file_info["item"]
                    file_directory_path = file_info["directory_path"]
                    filename = file_item.name
                    file_path = f"{file_directory_path}\\{filename}"

                    try:
                        # Get file statistics
                        stat_info = file_item.stat()
                        file_size = stat_info.st_size
                        created_time = datetime.fromtimestamp(stat_info.st_ctime)
                        modified_time = datetime.fromtimestamp(stat_info.st_mtime)

                        # Categorize file
                        category = get_file_category(filename)
                        files_by_category[category] += 1
                        total_files_processed += 1

                        # Create metadata record
                        metadata_record = {
                            "filename": filename,
                            "full_path": file_path,
                            "directory_path": file_directory_path,
                            "file_size_bytes": file_size,
                            "created_date": created_time,
                            "modified_date": modified_time,
                            "file_extension": os.path.splitext(filename)[1].lower(),
                            "category": category,
                            "processed_timestamp": pd.Timestamp.now(),
                        }

                        # Download and store files based on category
                        if category in ["document", "image", "spreadsheet"]:
                            try:
                                # Download file content
                                with smbclient.open_file(file_path, mode="rb") as smb_file:
                                    file_content = smb_file.read()

                                # Create safe filename for storage
                                safe_filename = create_safe_filename(file_path)

                                # Store in appropriate MediaSet
                                import io

                                file_stream = io.BytesIO(file_content)

                                if category == "document":
                                    documents_mediaset.put_media_item(file_stream, safe_filename)
                                elif category == "image":
                                    images_mediaset.put_media_item(file_stream, safe_filename)
                                elif category == "spreadsheet":
                                    spreadsheets_mediaset.put_media_item(file_stream, safe_filename)

                                logger.info(f"✓ Downloaded {category}: {filename} ({file_size:,} bytes)")

                            except Exception as download_error:
                                logger.error(f"✗ Failed to download {filename}: {str(download_error)}")
                                metadata_record["download_error"] = str(download_error)
                        else:
                            logger.info(f"⚠ Skipped {filename} (category: {category})")

                        metadata_records.append(metadata_record)

                    except Exception as file_error:
                        logger.error(f"✗ Error processing file {filename}: {str(file_error)}")
                        # Add error record
                        error_record = {
                            "filename": filename,
                            "full_path": file_path,
                            "directory_path": file_directory_path,
                            "file_size_bytes": 0,
                            "created_date": None,
                            "modified_date": None,
                            "file_extension": os.path.splitext(filename)[1].lower(),
                            "category": "error",
                            "processed_timestamp": pd.Timestamp.now(),
                            "processing_error": str(file_error),
                        }
                        metadata_records.append(error_record)

            except Exception as dir_error:
                logger.error(f"✗ Failed to process directory {directory_path}: {str(dir_error)}")
                # Add directory error record
                error_record = {
                    "filename": None,
                    "full_path": None,
                    "directory_path": directory_path,
                    "file_size_bytes": 0,
                    "created_date": None,
                    "modified_date": None,
                    "file_extension": None,
                    "category": "directory_error",
                    "processed_timestamp": pd.Timestamp.now(),
                    "processing_error": str(dir_error),
                }
                metadata_records.append(error_record)

        # Write metadata to output dataset
        if metadata_records:
            metadata_df = ctx.spark_session.createDataFrame(pd.DataFrame(metadata_records))
            file_metadata.write_dataframe(metadata_df)
            logger.info(f"✓ Written {len(metadata_records)} metadata records")
        else:
            logger.warning("No metadata records to write")

    except Exception as e:
        logger.error(f"✗ CRITICAL FAILURE: {str(e)}")
        raise

    # Log final summary
    logger.info("\n" + "=" * 80)
    logger.info("PROCESSING SUMMARY")
    logger.info("=" * 80)
    logger.info(f"Total files processed: {total_files_processed:,}")
    logger.info(f"Documents: {files_by_category['document']:,}")
    logger.info(f"Images: {files_by_category['image']:,}")
    logger.info(f"Spreadsheets: {files_by_category['spreadsheet']:,}")
    logger.info(f"Other/Skipped: {files_by_category['other']:,}")
    logger.info(f"Metadata records created: {len(metadata_records):,}")
    logger.info("=" * 80)