Anja's IMDb Web Scraper

1: Importing correct libraries

Just like with the last web scraping project, we must import our libraries for web scraping (requests and BeautifulSoup):

import requests from bs4 import BeautifulSoup

Now that we will want to install our data packages. In this project we will want to install Pandas, which is used to assemble the data into a data frame, plus it allows us to clean and analyze the data.

pip install pandas

Don’t forget that we need to import the library into our program:

import pandas as pd

Since we are scraping a source where languages of movies can change based on location, we need to create a header variable to ensure it is in English:

headers = {"Accept-Language": "en-US, en;q=0.5"}

Just like before we need to specify the URL and we want to store the information from the webpage, so we will use requests .get() method. This time, we need to have extra parameters. We want to specify headers = headers because the headers that we have within the website could be a different language, so we want to ensure that we are putting in the headers variable we created above (to make sure it is English).

URL = 'https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv'
results = requests.get(URL, headers = headers)

We will also have to parse our html, using BeautifulSoup:

soup = BeautifulSoup(results.text, 'html.parser')

To ensure you have collected data you can check your print out:

print(soup.prettify())

So here is the code we get for the first part:

2: Collecting and storing data

Now we will create lists to store the information we collect, rather than just printing it out. We want to initialize our lists. All you need to do to make a list is use “[]”. I wanted to get the following information: movie title, year released, length of movie, what genres it belongs to, the IMDb ratings, the metascore and how much did the movie make.

titles = []
years = []
length = []
genre = []
ratings = []
metascores = []
us_gross = []

Now we want to go to our element code found on our page (remember: can access by right-clicking page and selecting inspect):

As we can see, in order to get all the movie elements we will want to find_all() within ‘div’ and ‘class = lister-item’:

movie_elems = soup.find_all('div', class_ = 'lister-item')

Now in order to collect the specific data from the website, we will use a for-loop:

for i, movie_elem in enumerate(movie_elems):

Title

In order to extract the title element, we look at the element source code, we see that it is contained within the h3 section:

So we would have our code being:

title_elem = movie_elem.find('h3', class_ = 'lister-item-header').text.strip()

But we can actually simplify it. Looking at the nested titles, we can see that after the ‘div class="lister-item-content"’ we have ‘h3’ and then ‘a’ that contains the information we need, so we can simply specify that we want the title element to come from the movie element from h3.a and we want it to just remove the text:

title_elem = movie_elem.h3.a.text

We need to ensure that there is no “none type” attribute, so we can error check right away:

if movie_elem.h3.a:
title_elem = movie_elem.h3.a.text
else:
title_elem = 'No title found'

Since this is a very simple code we can further simplify it by writing it on one line.

title_elem = movie_elem.h3.a.text if movie_elem.h3.a else 'No title found'

Then we want to add the title element to our titles list by using .append() method:

titles.append(title_elem)

Year

Next, we want to pull data from about the year it was released:

Since we can see that there is more than one ‘span’ found in the nesting, we must specify which one needs to be found.

year_elem = movie_elem.h3.find('span', class_ = 'lister-item-year').text if movie_elem.h3.find('span', class_ = 'lister-item-year') else 'No year found'

Then also .append() it to the year list.

years.append(year_elem)

Runtime

To pull the runtime of the movie:

We want to pull from the subsection of ‘p’ after ‘div’ before we get to the ‘span’ which contains the runtime (you don’t necessarily need .p but it helps prevent errors from where the data is pulled):

length_elem = movie_elem.p.find('span', class_ = 'runtime').text if movie_elem.p.find('span', class_ = 'runtime') else 'No movie length found'
length.append(length_elem)

Genre

To pull the genre:

We want to use .strip() because the formatting of the string has spaces that we do not want.

genre_elem = movie_elem.p.find('span', class_ = 'genre').text.strip() genre.append(genre_elem)

Ratings

To pull the ratings:

When collecting data we want to make sure that each variable is classified as the correct data set. So we can automatically classify the rating element as a float (using float() function) because the texted extracted is a decimal. Since within the ‘div’ section we only have one subsection titled ‘strong’, we are able to use the same method we used for the movie title: movie_elem.strong.text.

rating_elem = float(movie_elem.strong.text) ratings.append(rating_elem)

Metascore

To pull the metascore:

We will notice that the metascore is found within the span container but the class attribute has two different strings “metascore” and “favorable” or “mixed”. Since the second string changes, we want to have the .find() method use the class attribute of ‘metascore’ to ensure it gets the different metascores. Similarly to above, the metascore is already an integer, so we can classify it as an int() right away.

metascore_elem = int(movie_elem.find('span', class_ = 'metascore').text)

We also want to make sure we check whether there are any metascores

metascore_elem = int(movie_elem.find('span', class_ = 'metascore').text) if movie_elem.find('span', class_ = 'metascore') else "No metascore found"
metascores.append(metascore_elem)

US Gross Worth

Now when we look at the US gross worth, we can see that only some of the movies have it:

However, we can see that for the movies that do have the gross, there is more than one span container that has the name attribute called ‘nv’. So this means we actually have to specify which one we want to pull out. So first thing we will do is .find_all() with the correct attributes, but if you put

nv = movie_elem.find_all('span', name = 'nv')

We will get an error due to the fact that the name attribute is not unique. So we must specify that we are looking at more than one attribute, and that it contains both the ‘name’ and ‘nv’:

nv = movie_elem.find_all('span', attrs = {'name' : 'nv'})

Now that we have collected all the data values from the name attributes, we need to specify which one is the one we will use for the US gross. We can do this by checking how many ‘nv’ were collected by .find_all() method. We check if the length of nv is greater than 1 (meaning more than one data point was collected), using the len() function. If there is more than one entry, we want to just pull the second index. Remember for indices it starts at 0, so when we want the second index we denoted it with a 1: nv[1].

if len(nv) > 1:
us_gross_elem = nv[1].text

Now if there was only one ‘nv’ pulled, we do not want it because it means that that specific movie does NOT have the US gross listed.

else:
us_gross_elem = "-"

Then we can append it to the list:

us_gross.append(us_gross_elem)

Here is the code for step two:

3: Data frames and cleaning up the data

Now that we have the data and it is stored in lists, we want to make a data frame to be able to edit the data and then we can move further to manipulate it. We will use Pandas to do this. Remember, that we imported Pandas at the start, but we gave it the variable pd, meaning when we reference it we just need to write “pd.” to access various functions from the Pandas library. I’ll quickly go over the format of Pandas DataFrame:

pandas.DataFrame({
‘title of column 1’ : [what you want column 1 to contain],
‘title of column 2’ : [what you want column 2 to contain],
...
})

So in our example we want to title the columns based on what the data displayed will be and we will connect it to the corresponding lists:

movies = pd.DataFrame({
'movie' : titles,
'year' : years,
'length (min)' : length,
'genre' : genre,
'IMDb ratings' : ratings,
'Metascore' : metascores,
'US gross (millions)' : us_gross,
})

You can check that it was done properly by:

print(movies)

In this post it is not that important that the data types are correct as the data will not be used/manipulated. However, it is important to get into a good habit of making sure the data types are correct because you never know when you might go back to your data and need to manipulate and use it.

So first thing we will do is check which data types we have:

print(movies.dtypes)

Here are the results I got:

movie object
year object
length object
genre object
IMDb ratings float64 *because I classified it as float() earlier
Metascore int64 *because I classified it as int() earlier
US gross object

As we can see we need to change some of these:

Year: should be integer
Length: should be integer (since it is in minutes)
US gross: should be float (as there are decimals)

Let’s start with the year, we will want to remove the parenthesis that come with the year. So from the string we will want to just extract the number. ‘extract(\d+)’ says to start extracting at the 1st digit, "+" means that it must have at least 1 digit. If we did not care about the number of integers needed, we could have written ‘extract(\d*)’, where "*" means you can start at 0 integers. Next, after we have extracted the digits from the string, we want to use .astype() method to convert the string to an integer. We also want to write over the data we had previously in the list with our newly edited integers, so we will set the variable to be the same:

movies['year'] = movies['year'].str.extract('(\d+)').astype(int)

Note: .str is an attribute of Pandas series (which is a column of a data frame)

Now for movie length, we will do the same thing because we want to remove just the number and leave out the “min”:

movies['length (min)'] = movies['length (min)'].str.extract('(\d+)').astype(int)

For the US gross, the number we want is found in between $ and M, with a decimal in the number, so we cannot just use the str.extract() method that we used for the other two cases. When I did research on how to remove the $ and M, what I found was that I can use .str.replace() method as many times as I wanted. Where .str.replace(‘what you want to remove’, ‘what you want to put in instead’), so I tried:

movies['US gross (millions)'] = movies['US gross (millions)'].str.replace('M', '').str.replace('$', '')

Although this works, this method is not the best as it is not as clean as it could be if we have a function. It also is requiring the .str to be iterated twice, which if we have a long list of replacements could slow down our program. This is why you are able to use the lambda function. I went down a rabbit hole researching lambda and wrote about it on my Fun Things Learnt page.

Then we want make sure that the US gross is a float, which we use Pandas for:

movies['us_grossMillions'] = pd.to_numeric(movies['us_grossMillions'])

However, this will yield an error due to the fact that we have specified that we do not want NaN, but rather we want “-”, so we must ask the function to “coerce” the errors:

movies['us_grossMillions'] = pd.to_numeric(movies['us_grossMillions'], errors='coerce')

Now that that has been completed, the last thing we need to do is save it to a CSV file, so we can access it later and use the data (should we chose to).

movies.to_csv('IMDbmovies.csv')

Now it is better to have an absolute path to make sure that it gets saved to the correct folder you want:

movies.to_csv('/Users/anjawu/Code/imdb-web-scraping/IMDbmovies.csv')

Full code found here!

Anja Wu

IMDb

Web Scraping Top 50 Movies: IMDb