A machine learning researcher has released a massive dataset containing one million public posts from leftist social media echo chamber Bluesky, raising questions about data privacy and consent. The data could potentially be used to train AI to be even more woke than notoriously left-leaning AI chatbots like ChatGPT.
404Media reports that in a move that has raised concerns about user privacy and consent, Daniel van Strien, a machine learning librarian at AI community platform Hugging Face, released a dataset composed of one million Bluesky posts. The dataset, intended for machine learning research, includes the text content of each post along with metadata such as the time of posting and the user’s decentralized identifier (DID).
Van Strien announced the dataset on Bluesky last week, stating, “This dataset contains 1 million public posts collected from Bluesky Social’s firehose API, intended for machine learning research and experimentation with social media data. Each post contains text content, metadata, and information about media attachments and reply relationships.”
While the data was collected from Bluesky’s public firehose API, which aggregates all public data updates on the platform in real-time, the inclusion of user DIDs has raised privacy concerns. The dataset is not anonymous, and van Strien also created a search tool for finding users based on their DID, which he published on Hugging Face.
A quick review of the dataset reveals that it contains a wide range of content, from political discussions and concert chatter to pornography. Notably, the dataset is a snapshot of Bluesky at a specific point in time, meaning it may include posts that have since been deleted by users. […]
— Read More: www.breitbart.com