New Show Hacker News story: Show HN: Version code, models, & datasets together in GitHub
Show HN: Version code, models, & datasets together in GitHub
15 by skadamat | 5 comments on Hacker News.
Hi HN! We just launched a GitHub integration that scales your Git repos to handle 100 terabytes of files in a single repo. XetData enables data scientists and machine learning engineers to version code, models, and datasets together. Most teams have glued together clunky workflows using S3, DVC, Git, Git LFS, and other tools and make true reproducibility difficult: https://ift.tt/n4oqxgX We instead embrace and extend Git so end-users don’t need to learn a new tool and a new set of commands. Our implementation is similar to Git LFS, where we take over the .gitattributes file, push pointers to large files in GitHub, and push the raw, large files to us. We have a few distinct features that we’re proud of that improve the user experience: - Our XetData bot comments on your pull requests to provide links to useful dataset views and model diffs. We’re working on rendering these inside GitHub itself using browser extensions. - Git LFS and similar tools only implement file-level deduplication. We created a new technique called block-based deduplication (published in CIDR’23 conference) specifically for data and ML workflows. The ML lifecycle consists of making lots of iterative changes and our technique helps save storage and time spent downloading and uploading changes. - You can mount large repos to your local machine using git-xet mount for exploratory work. Individual files that are needed are streamed in just in time behind the scenes. We open sourced our implementation of mount and it was well received here on HN: https://ift.tt/KlwtBcQ - To give more users access to your data, just add them to your GitHub repo. This is a beta product and we would love all of your feedback. You can find all instructions to try this out here: https://ift.tt/mHyhAbS While we’re in beta, our product is completely free to use. We have a Slack you can join or a GitHub issue tracker. - Slack: https://ift.tt/cRyfeB6 - GitHub: https://ift.tt/wpd2br3
15 by skadamat | 5 comments on Hacker News.
Hi HN! We just launched a GitHub integration that scales your Git repos to handle 100 terabytes of files in a single repo. XetData enables data scientists and machine learning engineers to version code, models, and datasets together. Most teams have glued together clunky workflows using S3, DVC, Git, Git LFS, and other tools and make true reproducibility difficult: https://ift.tt/n4oqxgX We instead embrace and extend Git so end-users don’t need to learn a new tool and a new set of commands. Our implementation is similar to Git LFS, where we take over the .gitattributes file, push pointers to large files in GitHub, and push the raw, large files to us. We have a few distinct features that we’re proud of that improve the user experience: - Our XetData bot comments on your pull requests to provide links to useful dataset views and model diffs. We’re working on rendering these inside GitHub itself using browser extensions. - Git LFS and similar tools only implement file-level deduplication. We created a new technique called block-based deduplication (published in CIDR’23 conference) specifically for data and ML workflows. The ML lifecycle consists of making lots of iterative changes and our technique helps save storage and time spent downloading and uploading changes. - You can mount large repos to your local machine using git-xet mount for exploratory work. Individual files that are needed are streamed in just in time behind the scenes. We open sourced our implementation of mount and it was well received here on HN: https://ift.tt/KlwtBcQ - To give more users access to your data, just add them to your GitHub repo. This is a beta product and we would love all of your feedback. You can find all instructions to try this out here: https://ift.tt/mHyhAbS While we’re in beta, our product is completely free to use. We have a Slack you can join or a GitHub issue tracker. - Slack: https://ift.tt/cRyfeB6 - GitHub: https://ift.tt/wpd2br3
Comments
Post a Comment