Introduction to {syncdr}
File Handling, Directory Comparison & Synchronization in R
Rossana Tatulli
syncdr.Rmd
Why {syncdr}?
{syncdr} is an R package for handling and synchronizing files and directories. Its primary objectives are:
- To provide a clear snapshot of the content and status of synchronization between two directories under comparison: including their tree structure, their common files, and files that are exclusive to either directory
- To make file organization and management in R easier: i.e., enabling content-based and modification date-based file comparisons, as well as facilitating tasks such as duplicates identification, file copying, moving, and deletion.
π‘
This article does not offer a comprehensive
overview of {syncdr} functionalities. Rather it
provides a sample workflow for working with the packageβs main functions
. After familiarizing yourself with this general workflow, read the
articles throughout the rest of this website -they will explore all
features of {syncdr} in a structured way.
Synchronizing with {syncdr}
Learn how to work with {syncdr} and compare and synchronize directories in R
Suppose you are working with two directories, letβs call them
left
and right
-each containing certain files
and folders/sub-folders.
Letβs first call syncdr
function
toy_dirs()
. This generates two toy directories in
.syncdrenv
environment -say left
and
right
- that we can use to showcase syncdr
functionalities.
# Create syncdr env with left and right directories
.syncdrenv =toy_dirs()
#> β β β β β β β β β 27% | ETA: 8s
#> β β β β β β β β β β β β β β β β β β β 60% | ETA: 5s
# Get left and right directories' paths
left <- .syncdrenv$left
right <- .syncdrenv$right
You can start by quickly comparing the two directoriesβ tree
structure by calling display_dir_tree()
. By default, it
fully recurses -i.e., shows the directory tree of all sub-directories.
However, you can also specify the number of levels to recurse using the
recurse
argument.
# Visualize left and right directories' tree structure
display_dir_tree(path_left = left,
path_right = right)
#> (β)Left directory structure:
#> /tmp/RtmphkzTu8/left
#> βββ A
#> β βββ A1.Rds
#> β βββ A2.Rds
#> β βββ A3.Rds
#> βββ B
#> β βββ B1.Rds
#> β βββ B2.Rds
#> β βββ B3.Rds
#> βββ C
#> β βββ C1.Rds
#> β βββ C2.Rds
#> β βββ C3.Rds
#> βββ D
#> β βββ D1.Rds
#> β βββ D2.Rds
#> βββ E
#> (β)Right directory structure:
#> /tmp/RtmphkzTu8/right
#> βββ A
#> βββ B
#> β βββ B1.Rds
#> β βββ B2.Rds
#> βββ C
#> β βββ C1.Rds
#> β βββ C1_duplicate.Rds
#> β βββ C2.Rds
#> β βββ C3.Rds
#> βββ D
#> β βββ D1.Rds
#> β βββ D2.Rds
#> β βββ D3.Rds
#> βββ E
#> βββ E1.Rds
#> βββ E2.Rds
#> βββ E3.Rds
Step 1: Compare Directories
The most important function in syncdr
is
compare_directories()
. It takes the paths of left and right
directories and compares them to determine their synchronization status
(see below). This function represents the backbone of
syncdr
: you can utilize the syncdr_status
object it generates both:
to inspect the synchronization status of files present in both directories as well as those exclusive to either directory
as the input for all other functions within
syncdr
that allow synchronization between the directories under comparison.
Before diving into the resulting syncdr_status
object,
note that compare_directories()
enables to compare
directories in 3 ways:
- By date only -the default: by default,
by_date = TRUE
, so that files in both directories are compared based on the date of last modification.
sync_status (all common files) |
---|
older in left, newer in right dir |
newer in left, olderin right dir |
same date |
- By date and content. This is done by specifying
by_content = TRUE
(by defaultby_date = TRUE
if not specifically set to FALSE). Files are first compared by date, and then only those that are newer in either directory will be compared by content.
sync_status (common files that are newer in either left or right, i.e., not of same date ) |
---|
different content |
same content |
- By content only, by specifying
by_date = FALSE
andby_content = TRUE
. This option is however discouraged -comparing all filesβ contents can be slow and computationally expensive.
sync_status (all common files) |
---|
different content |
same content |
Also, regardless of which options you choose, the sync_status of files that are exclusive to either directory is determined as:
sync_status (non common files) |
---|
only in left |
only in right |
Letβs now take a closer look at the output of
compare_directories()
, which is intended to contain
comprehensive information on the directories under comparison. This is a
list of class syncdr_status
, containing 4 elements: (1)
common files, (2) non common files, (3) left path and (4) right path
1. Comparing by date
# Compare by date only -the Default
sync_status_date <- compare_directories(left,
right)
sync_status_date
#>
#> ββ Synchronization Summary βββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> β’ Left Directory: /tmp/RtmphkzTu8/left
#> β’ Right Directory: /tmp/RtmphkzTu8/right
#> β’ Total Common Files: 7
#> β’ Total Non-common Files: 9
#> β’ Compare files by: date
#>
#> ββ Common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> path modification_time_left modification_time_right modified
#> 1 /left/B/B1.Rds 2024-11-04 20:18:10 2024-11-04 20:18:11 right
#> 2 /left/B/B2.Rds 2024-11-04 20:18:13 2024-11-04 20:18:14 right
#> 3 /left/C/C1.Rds 2024-11-04 20:18:11 2024-11-04 20:18:17 right
#> 4 /left/C/C2.Rds 2024-11-04 20:18:14 2024-11-04 20:18:15 right
#> 5 /left/C/C3.Rds 2024-11-04 20:18:16 2024-11-04 20:18:17 right
#> 6 /left/D/D1.Rds 2024-11-04 20:18:13 2024-11-04 20:18:12 left
#> 7 /left/D/D2.Rds 2024-11-04 20:18:16 2024-11-04 20:18:15 left
#>
#> ββ Non-common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> ββ Only in left ββ
#>
#> # A tibble: 4 Γ 1
#> path_left
#> <fs::path>
#> 1 /left/A/A1.Rds
#> 2 /left/A/A2.Rds
#> 3 /left/A/A3.Rds
#> 4 /left/B/B3.Rds
#> ββ Only in right ββ
#> # A tibble: 5 Γ 1
#> path_right
#> <fs::path>
#> 1 /right/C/C1_duplicate.Rds
#> 2 /right/D/D3.Rds
#> 3 /right/E/E1.Rds
#> 4 /right/E/E2.Rds
#> 5 /right/E/E3.Rds
2. Comparing by date and content
# Compare by date and content
sync_status_date_content <- compare_directories(left,
right,
by_content = TRUE)
sync_status_date_content
#>
#> ββ Synchronization Summary βββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> β’ Left Directory: /tmp/RtmphkzTu8/left
#> β’ Right Directory: /tmp/RtmphkzTu8/right
#> β’ Total Common Files: 7
#> β’ Total Non-common Files: 9
#> β’ Compare files by: date & content
#>
#> ββ Common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> path modification_time_left modification_time_right modified
#> 1 /left/B/B1.Rds 2024-11-04 20:18:10 2024-11-04 20:18:11 right
#> 2 /left/B/B2.Rds 2024-11-04 20:18:13 2024-11-04 20:18:14 right
#> 3 /left/C/C1.Rds 2024-11-04 20:18:11 2024-11-04 20:18:17 right
#> 4 /left/C/C2.Rds 2024-11-04 20:18:14 2024-11-04 20:18:15 right
#> 5 /left/C/C3.Rds 2024-11-04 20:18:16 2024-11-04 20:18:17 right
#> 6 /left/D/D1.Rds 2024-11-04 20:18:13 2024-11-04 20:18:12 left
#> 7 /left/D/D2.Rds 2024-11-04 20:18:16 2024-11-04 20:18:15 left
#> sync_status
#> 1 different content
#> 2 different content
#> 3 same content
#> 4 different content
#> 5 different content
#> 6 different content
#> 7 different content
#>
#> ββ Non-common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> ββ Only in left ββ
#>
#> # A tibble: 4 Γ 1
#> path_left
#> <fs::path>
#> 1 /left/A/A1.Rds
#> 2 /left/A/A2.Rds
#> 3 /left/A/A3.Rds
#> 4 /left/B/B3.Rds
#> ββ Only in right ββ
#> # A tibble: 5 Γ 1
#> path_right
#> <fs::path>
#> 1 /right/C/C1_duplicate.Rds
#> 2 /right/D/D3.Rds
#> 3 /right/E/E1.Rds
#> 4 /right/E/E2.Rds
#> 5 /right/E/E3.Rds
3. Comparing by content only
# Compare by date and content
sync_status_content <- compare_directories(left,
right,
by_date = FALSE,
by_content = TRUE)
sync_status_content
#>
#> ββ Synchronization Summary βββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> β’ Left Directory: /tmp/RtmphkzTu8/left
#> β’ Right Directory: /tmp/RtmphkzTu8/right
#> β’ Total Common Files: 7
#> β’ Total Non-common Files: 9
#> β’ Compare files by: content
#>
#> ββ Common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> path sync_status
#> 1 /left/B/B1.Rds different content
#> 2 /left/B/B2.Rds different content
#> 3 /left/C/C1.Rds same content
#> 4 /left/C/C2.Rds different content
#> 5 /left/C/C3.Rds different content
#> 6 /left/D/D1.Rds different content
#> 7 /left/D/D2.Rds different content
#>
#> ββ Non-common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> ββ Only in left ββ
#>
#> # A tibble: 4 Γ 1
#> path_left
#> <fs::path>
#> 1 /left/A/A1.Rds
#> 2 /left/A/A2.Rds
#> 3 /left/A/A3.Rds
#> 4 /left/B/B3.Rds
#> ββ Only in right ββ
#> # A tibble: 5 Γ 1
#> path_right
#> <fs::path>
#> 1 /right/C/C1_duplicate.Rds
#> 2 /right/D/D3.Rds
#> 3 /right/E/E1.Rds
#> 4 /right/E/E2.Rds
#> 5 /right/E/E3.Rds
*οΈβ£ Comparing directories with
verbose = TRUE
When calling compare_directories()
, you have the option
to enable verbose mode by setting verbose = TRUE
. This will
display both directories tree structure and, when comparing files by
content, provide progress updates including the time spent hashing the
files.
compare_directories(left,
right,
by_date = FALSE,
by_content = TRUE,
verbose = TRUE)
#> β cli-147-153
#> β B1.Rds [5ms]
#>
#> β cli-147-153
#> β B2.Rds [4ms]
#>
#> β cli-147-153
#> β C1.Rds [4ms]
#>
#> β cli-147-153
#> β C2.Rds [4ms]
#>
#> β cli-147-153
#> β C3.Rds [4ms]
#>
#> β cli-147-153
#> β D1.Rds [4ms]
#>
#> β cli-147-153
#> β D2.Rds [4ms]
#>
#> ββ Hashing completed! Total time spent: 0.08494329 secs ββ
#>
#> β cli-147-199
#> β B1.Rds [4ms]
#>
#> β cli-147-199
#> β B2.Rds [4ms]
#>
#> β cli-147-199
#> β C1.Rds [4ms]
#>
#> β cli-147-199
#> β C2.Rds [4ms]
#>
#> β cli-147-199
#> β C3.Rds [7ms]
#>
#> β cli-147-199
#> β D1.Rds [4ms]
#>
#> β cli-147-199
#> β D2.Rds [4ms]
#>
#> ββ Hashing completed! Total time spent: 0.07256794 secs ββ
#>
#> (β)Left directory structure:
#> /tmp/RtmphkzTu8/left
#> βββ A
#> β βββ A1.Rds
#> β βββ A2.Rds
#> β βββ A3.Rds
#> βββ B
#> β βββ B1.Rds
#> β βββ B2.Rds
#> β βββ B3.Rds
#> βββ C
#> β βββ C1.Rds
#> β βββ C2.Rds
#> β βββ C3.Rds
#> βββ D
#> β βββ D1.Rds
#> β βββ D2.Rds
#> βββ E
#> (β)Right directory structure:
#> /tmp/RtmphkzTu8/right
#> βββ A
#> βββ B
#> β βββ B1.Rds
#> β βββ B2.Rds
#> βββ C
#> β βββ C1.Rds
#> β βββ C1_duplicate.Rds
#> β βββ C2.Rds
#> β βββ C3.Rds
#> βββ D
#> β βββ D1.Rds
#> β βββ D2.Rds
#> β βββ D3.Rds
#> βββ E
#> βββ E1.Rds
#> βββ E2.Rds
#> βββ E3.Rds
#> ββ Synchronization Summary βββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> β’ Left Directory: /tmp/RtmphkzTu8/left
#> β’ Right Directory: /tmp/RtmphkzTu8/right
#> β’ Total Common Files: 7
#> β’ Total Non-common Files: 9
#> β’ Compare files by: content
#>
#> ββ Common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> path sync_status
#> 1 /left/B/B1.Rds different content
#> 2 /left/B/B2.Rds different content
#> 3 /left/C/C1.Rds same content
#> 4 /left/C/C2.Rds different content
#> 5 /left/C/C3.Rds different content
#> 6 /left/D/D1.Rds different content
#> 7 /left/D/D2.Rds different content
#>
#> ββ Non-common files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>
#> ββ Only in left ββ
#>
#> # A tibble: 4 Γ 1
#> path_left
#> <fs::path>
#> 1 /left/A/A1.Rds
#> 2 /left/A/A2.Rds
#> 3 /left/A/A3.Rds
#> 4 /left/B/B3.Rds
#> ββ Only in right ββ
#> # A tibble: 5 Γ 1
#> path_right
#> <fs::path>
#> 1 /right/C/C1_duplicate.Rds
#> 2 /right/D/D3.Rds
#> 3 /right/E/E1.Rds
#> 4 /right/E/E2.Rds
#> 5 /right/E/E3.Rds
Step 2: Visualize Synchronization Status
The best way to read through the output of
compare_directories()
is by visualizing it with
display_sync_status()
function.
For example, letβs visualize the sync status of common files in left and right directories, when compared by date
display_sync_status(sync_status_date$common_files,
left_path = left,
right_path = right)
or letβs display the sync status of non common files:
display_sync_status(sync_status_date$non_common_files,
left_path = left,
right_path = right)
Step 3: Synchronize directories
syncdr
enables users to perform different actions such
as copying, moving, and deleting files using specific synchronization
functions. Refer to the
vignette("asymmetric-synchronization")
and
vignette("symmetric-synchronization")
articles for detailed
information.
For the purpose of this general demonstration, we will perform a βfull asymmetric synchronization to rightβ. This specific function executes the following:
-
On common files:
- If by date only (
by_date = TRUE
): Copy files that are newer in the left directory to the right directory. - If by date and content (
by_date = TRUE
andby_content = TRUE
): Copy files that are newer and different in the left directory to the right directory. - If by content only (
by_content = TRUE
): Copy files that are different in the left directory to the right directory.
- If by date only (
-
On non common files:
- Copy to the right directory those files that exist only in the left directory
- Delete from the right directory those files that are exclusive in the right directory (i.e., missing in the left directory)
# Compare directories
sync_status <- compare_directories(left,
right,
by_date = TRUE)
# Synchronize directories
full_asym_sync_to_right(sync_status = sync_status)
#> β synchronized
#>