r/learnprogramming Jun 21 '24

Is a web scraping project too advanced for a complete beginner? How illegal is it?

I’m just starting to get into coding (took the Intro to Programming course on Codecademy, a complete beginner), and the first project I want to work on is coding a program that notifies me when a new product that I’m interested in, in my size, is posted on Buyee (a Japanese second-hand seller website). I started doing some research on how this might work using ChatGPT and Perplexity, and they both gave me similar results, suggesting I use Python libraries such as BeautifulSoup and requests for web scraping.

I’m aware the website already has a function that does a similar thing, but I wanted to make it myself to understand how it works. Since I don’t know too much about this, would you say this is too advanced for a beginner to start with?

Also, I researched whether there are any legal issues with scraping on Buyee, and the terms and services don’t mention anything, BUT the website is basically a proxy service for Yahoo Japan Shopping, Mercari, and Rakuten (second-hand online marketplaces), which do have scraping laws. How illegal is this really if it’s for personal use only?

Any other suggestions and advice are gladly welcomed. Thanks!

1 Upvotes

2 comments sorted by

2

u/plastikmissile Jun 21 '24

I don't think it's too advanced.

As for legality, we're not lawyers, so we're not qualified to give legal advice.

1

u/Clueless_Otter Jun 21 '24

Web scraping is not illegal. At most, it might be against a site's ToS, but breaking a ToS is not illegal. You'll just get banned from the site, not in any real trouble.

What you do with the data you scraped might be illegal depending on what it is, though. Any data that a website publishes that has any sort of creative element to it is automatically copyrighted, unless it is specifically noted to be copyright-free. What does "creative element" mean? Well that's a gray area. If it's purely just some numbers with nothing added to them, it's probably not copyrighted; pure "data" can't be copyrighted. There was a case relatively recently involving airline web scraping where someone was scraping a website to get airline ticket prices and the court ruled that a ticket price was just some data, not something that can be copyrighted. But I mean on the other end, if you, for example, scraped a cooking recipe or a movie review and then pasted it over onto your site, well that's gonna be copyright infringement. Stuff in between where it's kinda data-y but also might have some kinda added extra value from a site? Again, gray area.

It also depends what exactly you're scraping it for. A student project just to learn that you're not even going to host anywhere public? Extremely unlikely to be an issue, even if it might technically be illegal. Trying to scrape a website to start up your own commercial competitor? Definitely more likely to be an issue.