Webscraping is fun and can sometimes be an important part of a business analysis. However, sometimes you would wish Alteryx had a built-in function to easily identify HTML-tables on a given webpage and then be able to select the desired table. Instead you often have to Reg-Ex your way out of it to reach your goal.
In this post I’ll show you how 2 lines of code in the Python tool can aid this issue by using the built-in Pandas module to get a list of tables and then selecting the desired table for your analysis.
I’ll demonstrate this by scraping a table containing data from bold.dk on the current leaderboard in the Danish football league, Superligaen.
To reach my goal, I need as little as 4 tools and 2 lines of (additional) code. My workflow looks as follows:
- The input tool just contains the URL-string of the site. https://www.bold.dk/fodbold/danmark/superligaen/
- The download tool downloads the content of the page using the following configurations:
- The python tool is configured in the following way:
I know… This is not only 2 lines of code, but the first two lines are required in order to make the Python tool work with Alteryx, so they don’t count. The third line reads in the data from the download tool and the last line outputs the data, so that doesn’t count either.
The remaining 2 lines marked in the red box are the only lines that actually parses the table, where the first line just imports the Pandas module. This leaves us with one line to do the job.
The argument in brackets, r[‘DownloadData’][0]
simply selects the first line in the DownloadData-column coming from the Download tool. Python uses zero-based indexing which is why I put 0 in the square-brackets. This returns a large text string of HTML code in python which is used by the line pd.read_html(*)
. This returns a list of DataFrames or tables. Lastly, the argument in square brackets, [1]
selects the desired table from the list of tables. In my case, the table I’m interested in was the 2. table in the list of tables (remember that python uses zero-based indexing).
The last line outside the red box simply outputs the table back to the canvas, which lets you continue your analysis with the well-known Alteryx tool pallet.
- The select- and select records tools are just used for the final ordering of the table. In the end we get this beautiful result: