How to extract the location of shorter sequence based on longer sequence using Python? -
i have file sequence id , information of binding site location. extract location information without a,t,c,g information. shorter sequence above longer sequence shows location , every number on left example in file value 451 location value on left. location of short sequence on longer sequence start 453 (start site) , obtain length of shorter sequence size 21 , add 453 end site 474. can me?
file a.txt
chr1:152806601-152807450 ttcagcaccatggacagcgcc 451 ggcttcagcaccacggacagcgccccacccgcggccctccccccggcggcgcgctccagccggtgtaggcgaggc ttcagcaccatggacagcgcc 751 agagccccccgggactgcagagagcacctgggaggctggactgggaacgagacatactcgaaggagtaagtgaag chr10:125364276-125364825 ttcagcaccatggacagcgcc 301 cagtaatgtggggttgtggtcagcaccatggacagctcccctgttgcttcatattgaggaataggaaagcgccgc ttcagcaccatggacagcgcc 376 tatctccggatcctggctagctccagccactgcaggtaactgtcttgaatgggcttagaaacatggtgatgtctg
desired output
chr1:152806601-152807450 453 474 chr1:152806601-152807450 757 778 chr10:125364276-125364825 318 339 chr10:125364276-125364825 378 399
example code
import re open("a.txt", "r") f: lines = f.readlines() label_ptrn = re.compile("") # insert regular expression sequence id line_ptrn = re.compile("") # insert regular expression start site inner_ptrn = re.compile("") # insert regular expression end site all_matches = [] line in lines: m = label_ptrn.match(line) if m: label = m.groupdict().get("label") continue m = line_ptrn.match(line) if m: start = m.groupdict().get("start_value") sequence = m.groupdict().get("sequence") mi = inner_ptrn.search(sequence) if not mi: continue span = mi.span() all_matches.append((label, int(start)+span[0], int(start)+span[1])) open("a_ouput.bed", "w+b") f: m in all_matches: f.write('%s\t%i\t%i\n' % m)
so looks me start position of desired output off one.
ttcagcaccatggacagcgcc 451 ggcttcagcaccacggacagcgccccacccgcggccctccccccggcggcgcgctccagccggtgtaggcgaggc
the first t
in shorter sequence looks above fourth character of longer sequence. if first character of longer sequence @ position 451 make first character of shorter sequence @ position 454.
if file structure constant here non-regex solution.
result = [] open('file.txt') f: line in f: if line.startswith('chr'): label = line.strip() elif line[0] == ' ': # short sequence length = len(line.strip()) # find index of beginning of short sequence i, c in enumerate(line): if c.isalpha(): short_index = break elif line[0].isdigit(): # long sequence n = line.split(' ')[0] # or # n = line[:line.index(' ')] # find index of beginning of long sequence i, c in enumerate(line): if c.isalpha(): long_index = break start = int(n) + short_index - long_index # start -= 1 end = start + length result.append('{} {} {}'.format(label, start, end)) offset, n, start, length = 0, 0, 0, 0
result
['chr1:152806601-152807450 454 475', 'chr1:152806601-152807450 758 779', 'chr10:125364276-125364825 319 340', 'chr10:125364276-125364825 379 400']
if have misinterpreted example data, uncomment start -= 1
.
Comments
Post a Comment