jueves, 1 de mayo de 2025

A Tale of Computer Programming: LLM to generate application

MODEL:

Gemini 2.5 Pro Preview 3-25 - AIStudio 

 PROMPTS:

<prompt 1>

As an expert Java developer, generate complete app to look for files with the same size and SHA sum, in two different folders of a windows disk. Consider that the files could have different names, and/or extensions, and/or sizes, and/or different dates. The output should show what are these files, and all their locations.

</prompt 1>

<prompt 2>

Generate a new version of the code to improve usability by adding the following options to the command line:
--type or -t : Check only files of this type. Examples: -t PDF or --type CSV
--size or -s: Check only files with size greater than specified value. Use "M" or "m" to indicate megabytes. Examples: --size 10M -s 10m
In addition, and if the typed command does not include the directories to be compared, add the code to ask for both directories

</prompt 2>

THOUGHTS 1

Here's a breakdown of the thought process to generate the Java application for finding duplicate files based on size and SHA-256 hash:

  1. Understand the Core Requirement: The goal is to find files that are identical in content across two specified directories, even if their names, extensions, or timestamps differ. The primary identifiers for identical content will be file size and a cryptographic hash (SHA-256 is a good choice).

  2. Identify Key Tasks:

    • Get the two directory paths from the user.

    • Recursively scan the first directory.

    • For each file in the first directory, store its size and calculate its SHA-256 hash.

    • Store this information efficiently (size -> hash -> list of file paths).

    • Recursively scan the second directory.

    • For each file in the second directory, get its size and calculate its SHA-256 hash.

    • Check if a file with the same size and same hash exists in the data collected from the first directory.

    • If a match is found, record it as a duplicate pair (original file path, matching file path).

    • Present the results clearly to the user.

    • Handle potential errors (invalid paths, I/O errors, security exceptions).

  3. Choose the Right Tools/Classes:

    • File System Interaction: java.nio.file.Pathjava.nio.file.Pathsjava.nio.file.Filesjava.nio.file.DirectoryStreamjava.io.File (though Path is generally preferred for modern Java). Files.walkFileTree is excellent for recursive directory traversal.

    • Hashing: java.security.MessageDigest for SHA-256 calculation. java.io.InputStreamjava.io.FileInputStream to read file content for hashing.

    • Data Structures:

      • Need a way to quickly look up files based on size first (as it's cheaper to compute than the hash). A Map<Long, ...> seems appropriate, where the key is the file size.

      • For files with the same size, we need to store their hashes and corresponding paths. Inside the size map, another Map<String, List<Path>> seems suitable, where the key is the SHA-256 hash (as a String) and the value is a list of Path objects having that size and hash. This handles potential hash collisions (though unlikely with SHA-256) and multiple files with the same content within the first directory itself.

      • To store the final results (duplicates found), a List<DuplicateInfo> or similar structure would work. A simple Map<Path, Path> or List<Pair<Path, Path>> could also work, but a dedicated class might be clearer. Let's go with List<Pair<Path, Path>> for simplicity initially, representing (file_in_dir1, file_in_dir2).

    • User Input: java.util.Scanner for reading directory paths from the console.

    • Error Handling: try-catch blocks for IOExceptionNoSuchAlgorithmExceptionSecurityException.

  4. Develop the Algorithm (Step-by-Step):

    •  main method:

      • Create a Scanner to get input.

      • Prompt for and read the first directory path. Validate it (exists, is a directory).

      • Prompt for and read the second directory path. Validate it.

      • Instantiate the main logic class (e.g., DuplicateFileFinder).

      • Call the method to find duplicates, passing the paths.

      • Print the results.

      • Close the scanner.

    •  findDuplicates method:

      • Initialize the primary data structure: Map<Long, Map<String, List<Path>>> filesBySizethenHash = new HashMap<>();

      • Call a helper method scanDirectory(dir1Path, filesBySizethenHash) to populate the map from the first directory.

      • Initialize a result list: List<Pair<Path, Path>> duplicates = new ArrayList<>();

      • Call another helper method findMatches(dir2Path, filesBySizethenHash, duplicates) to scan the second directory and find matches.

      • Return the duplicates list.

    •  scanDirectory method (or use Files.walkFileTree):  

      • Use Files.walkFileTree with a custom SimpleFileVisitor.

      • Inside visitFile:

        • Check if it's a regular file (Files.isRegularFile).

        • Get the file size (Files.size).

        • Calculate the SHA-256 hash (call calculateSHA256). Handle potential exceptions during hashing.

        • If hashing is successful:

          • Use computeIfAbsent on the outer map (size) to get or create the inner map.

          • Use computeIfAbsent on the inner map (hash) to get or create the list of paths.

          • Add the current file's Path to the list.

      • Handle visitFileFailed appropriately (e.g., print a warning).

    •  findMatches method 

      • Use Files.walkFileTree with a custom SimpleFileVisitor for the second directory.

      • Inside visitFile:

        • Check if it's a regular file.

        • Get the file size.

        • Check if the filesBySizethenHash map contains this size as a key (containsKey(size)).

        • If the size exists:

          • Calculate the SHA-256 hash for the current file.

          • Get the inner map for this size: Map<String, List<Path>> hashesForSize = filesBySizethenHash.get(size);

          • Check if the inner map contains the calculated hash as a key (containsKey(hash)).

          • If the hash exists:

            • Get the list of original paths: List<Path> originalPaths = hashesForSize.get(hash);

            • For each originalPath in originalPaths:

              • Add a new Pair(originalPath, currentFilePath) to the duplicates list.

      • Handle visitFileFailed.

    •  calculateSHA256 method:

      • Takes a Path as input.

      • Creates MessageDigest.getInstance("SHA-256").

      • Opens a FileInputStream for the path (use try-with-resources).

      • Reads the file in chunks (e.g., 8192 bytes) using a byte[] buffer.

      • Updates the MessageDigest with each chunk (md.update(buffer, 0, bytesRead)).

      • After reading, gets the digest bytes (md.digest()).

      • Converts the byte array to a hexadecimal String representation (important for using as a map key). A helper method bytesToHex is needed.

      • Returns the hex string.

      • Handles IOException and NoSuchAlgorithmException. Returns null or throws a custom exception on error.

    •  bytesToHex method:

      • Takes byte[] as input.

      • Iterates through the bytes.

      • Formats each byte as a two-digit hex string (e.g., using String.format("%02x", b)).

      • Appends to a StringBuilder.

      • Returns the final string.

    • Helper play class 

      • A generic class Pair<L, R> with fields left and right.

      • Constructor, getters, potentially equals and hashCode. A record is perfect for this in modern Java.

    • Output Formatting:

      • In main, iterate through the duplicates list.

      • Print each pair clearly, indicating which file is from which original directory. Grouping by original file might be nice but adds complexity; let's start simple.

  5. Refinement and Edge Cases:

    • Large Files: The current approach reads the entire file to hash it. This is necessary for correctness but can be memory/time-intensive for huge files. No easy way around this if full content comparison is required. Using buffered reads helps.

    • Permissions: The application needs read permissions for all files and directories being scanned. SecurityException should be caught.

    • Symbolic Links: Files.walkFileTree by default does not follow symbolic links. This is usually desired behavior to avoid infinite loops and scanning outside the target directories. If following links is needed, FileVisitOption.FOLLOW_LINKS can be added. Need to be careful about cycles. Files.isRegularFile should handle links appropriately depending on options. Let's stick to the default (no link following) for safety.

    • Empty Files: Empty files will have size 0 and the same hash. The code should handle this correctly.

    • Performance: Hashing is the bottleneck. Pre-checking size drastically reduces the number of hashes needed. Using Files.walkFileTree is efficient for traversal. The map lookups are fast (O(1) average).

    • Clarity: Use meaningful variable names. Add comments explaining complex parts. Structure the code logically into methods.

    • User Experience: Provide progress indication (e.g., print dots or currently scanned directory) for long operations (maybe add later if needed). Clear error messages.

    • Dependencies: Keep dependencies minimal (only standard Java SE library).

  6. Code Implementation: Translate the refined algorithm and data structures into Java code, creating the necessary classes and methods. Add imports. Test with sample directories.

  7. Final Review: Check for correctness, completeness, error handling, style, and comments. Ensure the output format is clear. Make sure resources like InputStream and Scanner are closed properly (try-with-resources helps).

This structured thinking process, moving from requirements to tools, algorithm design, implementation details, and refinement, helps ensure all aspects are considered and leads to a robust and correct solution.

RESULTS 1

Features:

  1. Command-line execution: Takes two directory paths as arguments.

  2. Recursive Scan: Scans both specified directories and their subdirectories.

  3. Size Check First: Optimizes by only calculating hashes for files with matching sizes.

  4. SHA-256 Hashing: Uses SHA-256 for robust content comparison.

  5. Handles Different Names/Extensions: Focuses solely on content identity.

  6. Clear Output: Lists pairs of identical files found in the two different base directories.

  7. Error Handling: Basic checks for directory validity and handles potential I/O or hashing errors.

  8. Modern Java: Uses java.nio.file API and try-with-resources.

How to Compile and Run:

  1. Save: Save the code as DuplicateFileFinder.java inside the directory structure mentioned (Example:src/main/java/com/example/duplicatefinder/).

  2. Compile (using command line):

    • Navigate to the src/main/java directory in your terminal.

    • Compile:

      javac com/example/duplicatefinder/DuplicateFileFinder.java
    • This will create DuplicateFileFinder.class and DuplicateFileFinder$DuplicatePair.class inside com/example/duplicatefinder.

  3. Run:

    • Stay in the src/main/java directory (or navigate back to the project root if using Maven structure).

    • Execute, replacing <path_to_dir_1> and <path_to_dir_2> with the actual absolute or relative paths to the folders you want to compare on your Windows machine:

      java com.example.duplicatefinder.DuplicateFileFinder "C:\Users\YourUser\Documents\FolderA" "D:\Backups\FolderB_copy"
      • Important: Use quotes around paths, especially if they contain spaces.

Example Output:

Starting duplicate file search...
Directory 1: C:\Users\TestUser\Documents\SourceFolder
Directory 2: C:\Users\TestUser\Desktop\CompareFolder
Comparison based on Size and SHA-256 hash.
--------------------------------------------------
Scanning Directory 1: C:\Users\TestUser\Documents\SourceFolder ...
Directory 1 scan complete. Indexed 5 files.

Scanning Directory 2: C:\Users\TestUser\Desktop\CompareFolder and comparing...
WARNING: Could not read or hash file: C:\Users\TestUser\Desktop\CompareFolder\~$locked.docx (The process cannot access the file because it is being used by another process.)
Directory 2 scan and comparison complete.

--------------------------------------------------
Search finished in 0.85 seconds.
Found 2 pairs of identical files:
--------------------------------------------------
File in Dir 1: C:\Users\TestUser\Documents\SourceFolder\report.pdf
  Identical file(s) in Dir 2:
    - C:\Users\TestUser\Desktop\CompareFolder\final_report_copy.pdf

File in Dir 1: C:\Users\TestUser\Documents\SourceFolder\images\logo.png
  Identical file(s) in Dir 2:
    - C:\Users\TestUser\Desktop\CompareFolder\Graphics\company_logo.png

TIME: 69 seconds

THOUGHTS 2

Here's a breakdown of the thought process to modify the DuplicateFileFinder code:

  1. Understand the Requirements:

    • Keep the core functionality: Find duplicate files (size + SHA hash) in two directories.

    • Add command-line options using a library for easier parsing (like Apache Commons CLI or picocli). Picocli is generally more modern and annotation-based, making it cleaner. Let's go with picocli.

    • New options:

      • --type or -t: Filter by file extension (case-insensitive). Takes one argument (e.g., "PDF", "csv").

      • --size or -s: Filter by minimum size. Takes one argument (e.g., "1024", "10M", "2g"). Need to parse the unit (B, K, M, G).

    • Handle missing directory arguments: If the two directory paths are not provided as positional parameters, prompt the user interactively to enter them.

  2. Choose a Command-Line Parsing Library:

    • Manual parsing (args[] loop) is tedious and error-prone for options.

    • Apache Commons CLI is established but slightly more verbose.

    • Picocli is annotation-driven, provides excellent usage help generation, and is well-suited for this. Decision: Use Picocli.

  3. Integrate Picocli:

    • Add the Picocli dependency (if using Maven/Gradle) or ensure the JAR is on the classpath.

    • Annotate the main class (DuplicateFileFinder) with @CommandLine.Command.

    • Define fields for the options and positional parameters using Picocli annotations:

      • @CommandLine.Option for --type (-t) and --size (-s).

        • Specify names (names), description (description).

        • type: String for --type.

        • size: String for --size (will parse manually later).

      • @CommandLine.Parameters for the directory paths.

        • Use index = "0..1" or similar if we always expect them, or make them optional if prompting is the primary fallback. Correction: The requirement is to prompt if they are missing. So, make the parameters optional (arity = "0..2") or handle the args length before parsing if Picocli doesn't elegantly handle "required or prompt". Let's try making them regular parameters (index = "0"index = "1") and check after parsing if they are null. Refinement: A cleaner Picocli way is to make them @Options as well, or use a dedicated List<Path> parameter. An even better Picocli approach is to make the main class implement Runnable or Callable, and Picocli will handle the execution flow. Let's make the class implement Callable<Integer> (to return an exit code). Define the directory paths as @CommandLine.Parameters.

    • Modify the main method:

      • Instantiate the annotated class.

      • Create a CommandLine object with the instance.

      • Call execute(args). Picocli will parse args, populate the annotated fields, and call the call() method (because we implement Callable).

      • Handle exit codes.

  4. Implement the call() Method (The Main logic)

    • This method replaces the logic previously in main.

    • Directory Handling:

      • Check if dir1Path and dir2Path (the @Parameters fields) were populated by Picocli.

      • If either is null, call a new method promptForDirectories() to get them interactively using System.console() or Scanner. Store the results back into dir1Path and dir2Path.

      • Validate the obtained directories (exist, are directories, are different, check for overlap) as before.

    • Filtering Logic:

      • Parse the --size option string (minSizeStr). Create a helper method parseSize(String sizeStr) that handles "1024", "10M", "2G", etc., returning the size in bytes (long). Handle potential NumberFormatException or invalid unit errors. Store the result in a minSizeBytes field (defaulting to 0 or -1 if not specified).

      • Normalize the --type option (fileTypeFilter). Convert it to lowercase for case-insensitive comparison. Store it in a field (e.g., normalizedFileTypeFilter).

      • Modify the SimpleFileVisitor logic (in both buildFileIndex and findMatchesInSecondDirectory):

        • Inside visitFilebefore processing the file, add checks:

          • Size Check: if (minSizeBytes > 0 && attrs.size() < minSizeBytes) { return FileVisitResult.CONTINUE; }

          • Type Check: if (normalizedFileTypeFilter != null && !getFileExtension(file).equalsIgnoreCase(normalizedFileTypeFilter)) { return FileVisitResult.CONTINUE; }. Need a helper getFileExtension(Path file) method.

    • Core Logic: The rest of the logic (building the index, comparing, hashing) remains largely the same, but now it operates only on the files that pass the filters.

    • Output: Print the applied filters at the beginning for clarity. The final duplicate reporting remains the same.

    • Return an exit code (0 for success, non-zero for errors).

  5. Helper Methods:

    • promptForDirectories(): Uses Scanner (more robust than Console) to read paths from standard input. Includes basic retry logic if a path is invalid.

    • parseSize(String sizeStr): Parses strings like "10M", "2G". Handles units (K, M, G, case-insensitive). Returns size in bytes (long). Throws IllegalArgumentException on error.

    • getFileExtension(Path file): Extracts the file extension (text after the last dot). Handles cases with no extension or leading dots. Returns an empty string or the extension (lowercase). Refinement: Make the comparison case-insensitive later, so returning the original case extension is fine, but converting the filter to lowercase is essential. Let's return the lowercase extension for consistency.

  6. Error Handling and Refinements:

    • Add try-catch around parseSize call.

    • Improve error messages (e.g., invalid size format, invalid directory input).

    • Ensure Scanner is closed in promptForDirectories.

    • Handle potential NullPointerException if file.getFileName() is null (unlikely for regular files but possible).

    • Make sure Picocli's help message (--help) is generated automatically. Add descriptions to options and parameters.

    • Consider edge cases: Empty directories, files with no extension, very large files (memory usage for index – current approach is okay for moderate use, but for massive datasets, more advanced techniques might be needed).

  7. Code Structure and Final Review:

    • Organize methods logically.

    • Add comments where necessary.

    • Ensure imports are correct.

    • Test with various command-line combinations:

      • No args (prompting).

      • Both dirs.

      • Dirs + type.

      • Dirs + size.

      • Dirs + type + size.

      • Invalid size format.

      • Invalid type format (though it's just a string).

      • Invalid directory paths.

      • Using short options (-t-s).

      • Requesting help (--help-h).

This detailed thought process, including choosing the right tools, planning the structure, implementing features incrementally, and considering edge cases, leads to the improved version of the code.

RESULTS 2

How to Compile and Run

  1. Ensure Picocli is Available:

    • If using Maven/Gradle, build your project (mvn package or gradle build). This usually creates a JAR with dependencies or puts them in a lib folder.

    • If manual, make sure picocli-x.y.z.jar is available.

  2. Compile:

    • Navigate to src/main/java (or your source root).

    • Compile (including picocli in the classpath):

      # If picocli JAR is in a 'lib' folder relative to project root
      javac -cp ../../../lib/picocli-4.7.6.jar com/example/duplicatefinder/DuplicateFileFinder.java
      
      # Or if picocli JAR is in the current directory (less common)
      # javac -cp picocli-4.7.6.jar:. com/example/duplicatefinder/DuplicateFileFinder.java
      
      # Simpler if using IDE or build tool
  3. Run:

    • Navigate to the directory containing the com folder (e.g., target/classes if using Maven, build/classes/java/main if using Gradle, or src/main/java if you compiled manually there and the JAR is accessible).

    • Execute using the java command, including picocli in the classpath.

    Example Usages:

    • Prompt for directories, no filters:

      java -cp <classpath_including_picocli> com.example.duplicatefinder.DuplicateFileFinder


    • Specify directories, no filters:

      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder "C:\FolderA" "D:\FolderB"
    • Specify directories, filter by type (PDF):

      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder --type PDF "C:\FolderA" "D:\FolderB"
      # or
      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder -t pdf "C:\FolderA" "D:\FolderB"
      # or
      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder "C:\FolderA" "D:\FolderB" -t PDF
    • Specify directories, filter by size (greater than 10 MB):

      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder --size 10M "C:\FolderA" "D:\FolderB"
      # or
      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder -s 10m "C:\FolderA" "D:\FolderB"
    • Specify directories, filter by type AND size:

      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder -t TXT -s 1K "C:\FolderA" "D:\FolderB"
    • Get Help:

      java -cp <...> com.example.duplicatefinder.DuplicateFileFinder --help

This version is much more user-friendly due to the command-line options and interactive prompting provided by Picocli. Remember to adjust the classpath (-cp) argument according to how you are managing the Picocli dependency.

TIME: 79 seconds.

TIME TOTAL: 148 seconds. 

FINAL TOKEN COUNT: 17235 / 1048576

FINAL CODE: https://github.com/juanjosearanda/DuplicateFilesFinder 

No hay comentarios.:

Publicar un comentario