Quick recipes to extract URLs — JavaScript, Python, Bash
JavaScript (browser or Node)
- Use a robust regex to find http/https URLs:
javascript
const text = “Visit https://example.com/page?x=1 and http://sub.example.org.”; const urls = […text.matchAll(/https?:\/\/[^\s”‘<>]+/gi)].map(m => m[0]); console.log(urls);
- To extract hrefs from HTML in browser:
javascript
const anchors = Array.from(document.querySelectorAll(‘a[href]’)); const hrefs = anchors.map(a => a.href);
Python
- Simple regex to extract full URLs:
python
import re text = “See https://example.com and http://sub.example.org/page” pattern = r’https?://[^\s”\‘<>]+’urls = re.findall(pattern, text) print(urls)
- Use urllib/BeautifulSoup for HTML-safe extraction:
Bash
- Using grep (PCRE) to extract http/https links from a file or stdin:
bash
# GNU grep with -P (PCRE) and -o to print only matches grep -oP ‘https?://[^\s”’</span>‘’<>]+’ file.txt
- Using awk (portable-ish):
bash
awk ’{ while (match(\(0, /https?:\/\/[^ \t"'</span><span class="token" style="color: rgb(57, 58, 52);">\</span><span>'</span><span class="token" style="color: rgb(163, 21, 21);">'<>]+/)) { </span><span class="token" style="color: rgb(163, 21, 21);"> print substr(\)0, RSTART, RLENGTH) \(0 = substr(\)0, RSTART+RLENGTH) } }’ file.txt
Notes (concise)
- Regexes above work for common cases but can miss edge cases (nested parentheses, punctuation).
- For HTML, prefer HTML parsers (DOM in JS, BeautifulSoup in Python) over regex.
Leave a Reply