Super simple python web scraper/file downloader

After waddling my way through some python learning courses, I finally stumbled into an excellent “next step” programming challenge. It had the ideal combination of a connection to my real life, super straightforward goals, and a handful of moving parts that I was pretty sure I could figure out (but that I would, in fact, have to figure out). The project was to download an image of the front page of every People’s Daily back to 1993. This post is going to walk through the process of how to build the python script in a way that I wish someone had done when I was trying to figure this out. That means instead of code snippets inside of a longer unified program (or just snippets) it will have a series of discrete programs that build upon themselves. For me, that makes it easier to figure out how each part works.

The “why” I did this is not particularly important, but the “how” is. The excellent news is that these images are all stored on a server in a standard way. For example, the image for the cover on April 26, 2016 lives here:

http://58.68.146.102/pic/101p/2000/04/2000040101.jpg

after the /101p/ the pattern is simple: year/month/YearMonthDayPage.jpg

That means I didn’t have to mess with creating a BeautifulSoup object or any real web scraper. All I needed to do was to create a script that would download the file, move on to the next day, and download that file.

While that is simple from a programming standpoint, it still requires actual steps that you actually need to code. The rest of this post walks through each step with a script you can download and break however you want. All of the code, which is hosted at this github repository is licensed under a CC0 license (more on why I decided to do that at the end).

Step 1: Download & save a picture

The first thing I decided I needed to figure out was how to have python download a picture and save it. While I started with the template from the excellent Automate the Easy Stuff chapter on web scraping (specifically the exercise that shows you how to download XKCD strips), I quickly realized that it was overkill. I didn’t need to download and parse a page to find the URL - I already had it. After a few dead ends using requests, I ended up using the urllib library. The syntax was fairly easy: call it and then pass it the URL you are downloading from and the name you want to save it as. Here is a fully functioning script using urllib:

import urllib

urllib.urlretrieve(‘http://58.68.146.102/pic/101p/2000/04/2000040101.jpg’, 'test.jpg’)

The first line imports the urllib library. The second line calls the urlretrieve function from urllib, tells it to download the file at http://58.68.146.102/pic/101p/2000/04/2000040101.jpg and save it to a file called test.jpg. Note that this will save the file in the same directory as the python script. You can call the output file whatever you want.

Step 2: Count and Correct Numbers

Next, I needed to remind myself/confirm that I understood how to add 1 to a variable a certain number of times. That’s a fairly straightforward concept.

I also needed to figure out how to make sure the numbers the represented the date in the program would work when applied to the URL. The target URL always uses two-digit dates. That means that April is not represented as “4″ but rather as “04″. In order to be able to find the right file I had to figure out how to turn any integer into a string with two characters no matter what.

Here’s the script that does both of those things:

day = 5
for i in range(0, 15):
     print day
     day_fixed = str(day).zfill(2)
     print day_fixed
     day += 1
     print “I’ve added 1”

The first line sets day to a starting integer. Note that this is an actual number, not a string.

The rest of the script is a loop that will run a set number of times. To pick the number of times it will run, change the second argument passed to range (currently that is the 15) to whatever number you want.

All of the “print” commands are just for debugging. The real action is in the two other lines.

day_fixed is where I hold the version of the day that is turned into a two character string (this becomes important later). “str(day)” turns the integer day into a string. .zfill(2) forces that string to be two characters, and will add a 0 to the front if need be.

day += 1 just takes day and adds 1 to it. That way the days advance every time the loop runs.

Step 3: Iterate

The next step was to combine the ability to advance the dates with the ability to download the file. I knew that I would eventually have to make all of the date variables advance (day, month, and year), but I decided to make sure I could do just one before tackling all of them. Because of that, the script uses strings for the year, month, and page parts of the URL. That lets it focus on changing just the day:

import urllib
core = “http://58.68.146.102/pic/101p/”
#these all have to be numbers not strings - need to change them
year = “2000”
month = “04”
day = 2
page = “02”
for i in range(0,6):
     day += 1
     #turns day into a string
     day_fixed = str(day).zfill(2)
     urllib.urlretrieve(core+year+“/”+month+”/“+year+month+day_fixed+page+”.jpg", year+“-”+month+“-”+day_fixed+“-”+page+“.jpg”)

Again, the first line imports the urllib that is used to actually download the image file.

The “core” variable is the part of the URL that says constant. I didn’t have to make it a variable, but since the target URL is a bit complicated turning it into a variable make it a little bit easer to work.

After the note to myself that I’ll need to change the strings to numbers, there is the list of variables that represent the date. Then there is a loop that is identical to the last script without the print lines for troubleshooting (it works….).

The last line is the command to download the file. It is a single line, although it might wrap in your browser. The first part just calls the urllib.urlretrieve() function. This function will download whatever URL is in the ().

Inside the () is the URL broken up into pieces. These need to be broken up because while the relative position of each variable is the same - the path will always end with the month followed by the day followed by the page - the actual value of the variables will change as the script works its way through the process. That’s kind of the point. For elements that do not have a variable, I just used strings surrounded by ““.

Each time the loop runs the day is advanced 1, the two character version of the day is created, and then the file at a URL that includes that new day is downloaded and assigned a file name that is also based on the date.

While the first argument passed to the urllib function has to be what it is, you could change the second function to whatever you want. I decided to name each file year-month-day-page.jpg.

Step 4: Multiple Variables

Thenext step is to expand the script so it iterates through all of the variables (date, month, and year) not just date. Of course, it has to do this in a systematic way so that every combination that corresponds to a date is actually downloaded. In order to do that, I nested loops within loops. I wasn’t sure it would work, so I started with just two variables (day and month) to see what would happen:

import urllib
core = “http://58.68.146.102/pic/101p/”
#these all have to be numbers not strings - need to change them
year = “2000”
month = 4
day = 2
page = “01”
for i in range(0,6):
   day += 1
     #turns day into a string
     day_fixed = str(day).zfill(2)

     for i in range(0,4):
         month += 1
         month_fixed = str(month).zfill(2)
         urllib.urlretrieve(core+year+“/”+month_fixed+“/”+year+month_fixed+day_fixed+page+“.jpg”, year+“-”+month_fixed+“-”+day_fixed+“-”+page+“.jpg”)
     month = 4

This is just the previous script with an additional for loop related to the month. There are two important things that need to be done in order for this to work correctly. First, the urllib function must be inside the deepest loop. The most nested loop runs all the way through before jumping up a level, so if urllib is bumped out a few levels you will miss a lot of the files.

Second, you need to reset the variable when its loop is done. The month loop starts at 4 (because that’s what I set it at in the start of the script) and works its way through 4 times until month is 8 (I limited the number of iterations for testing). Once it hits 8, the loop exits, the day variable is moved one day forward, and the month loop starts over again.

However, if that is all you do, the month loop will start where it left off last time - at 8 instead of at 4. This can create a problem if you want it to run, say 31 times for every month. By the second month you will be trying to download files for the 42nd day of February.

In order to avoid that, I reset the month variable to the original variable outside of the month loop. Now every time it runs it starts from the same place.

Step 5: Bringing it All Together

Now that I have figured out how all of the pieces worked, I was ready for the final version:

import urllib
core = “http://58.68.146.102/pic/101p/”
#start all of these one below where you want to start because step 1 is +=1
#DON’T FORGET IF YOU CHANGE A VALUE HERE TO ALSO CHANGE IT AT THE END OF THE FOR
year = 2002
month = 0
day = 0
page = “01”
for i in range(0,31):
     day += 1
     #turns day into a string
     day_fixed = str(day).zfill(2)

     for i in range(0,12):
         month += 1
         month_fixed = str(month).zfill(2)
         for i in range(0,6):

             year += 1
             year_fixed = str(year)
             #this needs to go at the bottom of the nest
             urllib.urlretrieve(core+year_fixed+“/”+month_fixed+“/”+year_fixed+month_fixed+day_fixed+page+“.jpg”, year_fixed+“-”+month_fixed+“-”+day_fixed+“-”+page+“.jpg”)
             year = 2002
         #this resets the month to the starting point so it stays in the right range
         month = 0

This uses year, month, and day as variables. Note that it doesn’t work through the pages. That variable is necessary for the download URL, but for this first version I didn’t need every page of the paper. As the all caps comment suggests, you need to set all of the variables one less than the starting variable since the first thing that happens is that 1 is added to them. You can also manipulate how many years of covers you get by changing the second argument in the year range() function (currently it is set at 6).

Improvements

This script works, which is exciting. However, there are a few things I might improve going forward:

Be smart about dates. Right now it downloads 31 days of covers for every month. Obviously that means that some of the files are junk because there is no Feb 31st. The way to fix this would be to add an if statement to the day loop that changes the number of iterations based on the value of month. Since I didn’t really mind having a few garbage files I didn’t worry too much about this.
Reorder the nesting. Right now the script will download the January 1 cover for each year, then move on to the January 2 cover for each year, and so on. This is fine, but if the download is interrupted for some reason it makes it a bit harder to understand where to restart and it makes it a bit harder to get a sense of how far into the process you are by looking at what has been downloaded. By reordering the nesting, I could make it download the covers in chronological order.
Handle interruptions. The first two fixes are fairly easy, but this one would require actual additional research. I was testing on this on a few wonky wifi connections, and sometimes I would lose the connection in the middle. This would cause the script to crash and stop. It would be great to learn how urllib.urlretrieve handles these sorts of problems, and make the script robust enough to recover. It can take a few minutes for the script to run, and it was a shame when I cam back only to see that it had crashed three minutes in.

The License

Why a CC0 license? First, these scripts are so simple that one could probably argue that they are devoid of copyrightable content (except for the comments). But second, and more importantly, to the extent they are useful to anyone they will be useful for hacking around with trying to learn python. Even an MIT license would require people to keep a license intact, which seemed too burdensome for this purpose. CC0 clarifies - to the extent that anyone is worrying about it - that you can do whatever you want with these scripts without worrying about copyright.

That’s the show. Hopefully this is helpful to someone else trying to teach themselves how to do this.

Super simple python web scraper/file downloader

April 30, 2016

Licenses are Not Proxies for Openness in AI Models

Carlin AI Lawsuit Against 'Impression with Computer'

How Explaining Copyright Broke the Spotify Copyright System